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1.  INTRODUCTION 

Through  vision,  we  derive  a  rich  understanding  of  what  is  in  the  world,  where  objects 
are  located,  and  how  they  axe  changing  with  time.  Because  we  obtain  this  understand¬ 
ing  immediately,  effortlessly,  and  without  conscious  introspection,  we  can  be  deceived  into 
thinking  that  vision  should  therefore  be  a  fairly  simple  task  to  perform.  The  computational 
approach  to  the  study  of  vision  inquires  directly  into  the  sort  of  information  processing 
needed  to  extract  important  information  from  the  changing  visual  image  —  information 
such  as  the  three-dimensional  (3-D)  structure  and  movement  of  objects  in  the  scene,  or  the 
color  and  texture  of  object  surfaces.  An  important  contribution  that  computational  studies 
have  made  is  to  show  how  difficult  vision  is  to  perform,  and  how  complex  are  the  processes 
needed  to  perform  visual  tasks  successfully. 

Levels  of  analysis 

The  development  of  computers  with  increasing  power  and  sophistication  has  often  stim¬ 
ulated  comparisons  between  computers  and  the  human  brain,  especially  since  computers 
have  been  applied  more  and  more  to  tasks  that  were  formerly  considered  uniquely  human 
capabilities,  such  as  understanding  natural  language.  It  is  clear,  however,  that  at  the  level 
of  their  hardware,  neurons  and  computer  circuits  are  very  different.  We  can  nevertheless 
attempt  to  describe  the  processes  that  take  place  in  these  two  systems  at  a  level  that  is  es¬ 
sentially  independent  of  this  hardware  —  this  is  a  description  of  the  tasks  that  they  perform. 
In  much  the  same  way  that  we  can  describe  the  theory  of  arithmetic  independent  of  the 
computing  device  that  carries  out  the  arithmetic  operations,  we  can  describe  the  theory  of 
vision  independent  of  the  hardware  that  carries  it  out,  whether  it  be  biological  or  computer 
hardware. 

This  idea  of  separating  the  tasks  performed  by  a  vision  system  from  the  hardware 
that  carries  out  these  tasks  was  central  to  the  work  of  David  Marr  (1982).  Marr  argued 
that  there  are  at  least  three  different  levels  at  which  problems  in  vision  can  be  described, 
which  he  labelled  the  computational  theory,  algorithm,  and  mechanism.  Theoretical  issues 
include  an  analysis  of  the  way  in  which  properties  of  the  physical  world  constrain  how 
problems  in  vision  are  solved.  An  algorithm  is  a  step-by-step  procedure  that  transforms 
one  representation  of  visual  information  into  the  next.  Finally,  the  mechanism  refers  to  the 
details  of  how  a  computation  is  carried  out  in  neural  or  computer  hardware.  These  three 
levels  of  analysis  are  a  useful  methodological  distinction  for  studying  visual  processing,  but 
the  distinction  between  levels  such  as  algorithm  and  mechanism  is  often  not  a  snarp  one  for 
the  brain. 

An  aspect  of  computational  studies  that  distinguishes  them  from  other  theoretical  ap¬ 
proaches  to  the  study  of  vision  is  the  notion  that  an  effective  way  to  analyze  the  compu¬ 
tational  strategies  used  by  biological  systems  to  perform  complex  visual  tasks  is  to  build 
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computer  vision  systems  that  use  similar  strategies.  The  synthesis  of  machine  vision  sys¬ 
tems  allows  rigorous  testing  of  whether  some  strategy  that  is  hypothesized  for  the  biological 
system  really  works  at  solving  an  important  problem  in  vision.  Building  machine  vision 
systems  often  uncovers  difficult  aspects  of  a  problem,  and  possible  solutions,  that  were  not 
realized  upon  first  consideration. 

Experimental  studies  from  psychology  and  the  neurosciences  provide  critical  insights 
into  the  particular  computations  that  underlie  visual  processing  in  biological  systems.  A 
given  problem  typically  can  be  solved  in  more  than  one  way;  there  are  different  choices  for 
how  properties  of  the  physical  world  can  constrain  its  solution,  and  different  algorithms, 
or  procedures,  that  can  be  used  to  carry  out  the  solution.  Computer  implementations  of 
possible  strategies  can  show  how  the  overall  behavior  of  the  system  depends  on  the  choice  of 
constraints  and  algorithms,  and  critical  experimental  tests  can  be  devised  to  determine  what 
choices  are  made  in  biological  systems.  Thus,  a  fruitful  interaction  between  computational 
and  experimental  approaches  can  contribute  to  our  understanding  of  biological  vision. 

Low  level  and  high  level  vision 

Biological  vision  begins  with  measurements  of  the  amount  of  light  reflected  from  surfaces 
in  the  environment  onto  the  eye.  The  retinal  image  provided  by  the  photoreceptors  can  be 
thought  of  as  a  large  array  of  continuously  changing  numbers  that  represent  light  intensities. 
From  this  array  of  light  measurements,  the  visual  system  does  not  achieve  an  understanding 
of  what  is  in  the  scene  in  a  single  step.  A  tenet  of  computational  studies  is  that  vision 
proceeds  in  stages,  with  each  stage  producing  increasingly  more  useful  descriptions  of  the 
world.  The  process  of  vision  can  be  viewed  as  the  construction  of  a  series  of  representations 
of  visual  information  with  explicit  computation  that  transforms  one  representation  into  the 
next. 

It  is  not  yet  known  how  biological  systems  represent  visual  information,  but  compu¬ 
tational  studies  have  suggested  several  representations  that  are  useful  in  visual  processing 
(see,  for  example,  Marr,  1982;  Ballard  and  Brown,  1982;  Horn,  1986;  Fischler  and  Firschein, 
1987).  Representations  proposed  for  the  early  stages  of  vision  first  capture  information  that 
cam  be  extracted  simply  and  directly  from  the  initial  image,  such  as  the  location  and  de¬ 
scription  of  significant  intensity  changes  or  edges  in  the  image.  Subsequent  representations 
capture  the  local  geometry  or  three-dimensional  shape  of  visible  surfaces  in  the  scene,  rep¬ 
resented  as  the  orientation  or  depth  of  surfaces  at  each  location  in  the  scene.  Many  familiar 
visual  processes,  such  as  the  analysis  of  movement,  binocular  stereopsis,  surface  shad*  .g, 
texture,  and  color,  contribute  to  the  computation  of  these  early  visual  representations.  We 
refer  to  these  early  stages  of  vision  as  low  level  vision.  A  main  goal  of  low  level  visual 
processes  is  to  recover  properties  of  the  surrounding  environment,  and  the  representations 
that  they  deliver  can  be  evaluated  on  the  basis  of  their  validity,  that  is,  whether  the  results 
they  deliver  are  correct  and  accurate. 
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Later  representations  for  vision  capture  information  necessary  to  solve  complex  tasks 
such  as  navigation  through  the  environment,  manipulation  of  objects,  and  recognition.  The 
visual  processes  involved  in  accomplishing  these  tasks  are  often  referred  to  as  high  level 
vision.  Although  there  is  no  well-defined  boundary  between  low  and  high  level  vision,  the 
distinction  is  often  a  useful  one.  The  main  difference  between  low  and  high  level  visual  pro¬ 
cesses  is  in  the  kind  of  knowledge  they  use.  Low  level  vision  relies  on  assumptions  regarding 
the  general  physical  properties  of  objects,  such  as  continuity  and  rigidity.  High  level  tasks, 
such  as  recognition,  often  use  some  knowledge  we  have  acquired  about  specific  objects,  such 
as  their  shape  and  perhaps  the  transformations  that  they  may  undergo.  Additional  dis¬ 
tinctions  between  low  and  high  level  vision,  and  the  description  of  an  intermediate  domain 
between  them,  are  discussed  in  section  3  of  this  chapter. 

Scope  of  the  chapter 

The  next  section  focuses  on  three  problem  areas  in  which  there  has  been  substantial 
computational  work,  as  well  as  important  interactions  between  computational  work  and 
experimental  studies  from  psychology  and  the  neurosciences.  In  particular,  we  consider  tfie 
problems  of  edge  detection,  binocular  stereo  and  the  analysis  of  visual  motion.  Overall,  re¬ 
search  in  computational  vision  spans  a  wide  range  of  problems  that  also  include  the  analysis 
of  texture,  color,  and  the  recovery  of  3-D  shape  from  image  contours  and  smooth  shading. 
Most  early  work  explored  these  problems  in  isolation,  while  recent  work  has  emphasized 
the  integration  of  multiple  visual  processes.  Reviews  of  computational  work  in  these  areas 
can  be  found  in  a  number  of  recent  books  (for  example,  Brady,  1981;  Ballard  and  Brown, 
1982;  Marr,  1982;  Beck,  Hope  and  Rosenfeld,  1983;  Ullman  and  Richards,  1984;  Levine, 
1985;  Horn,  1986;  Pentland,  1986;  Brown,  1987;  Fischler  and  Firschein,  1987;  Richards  and 
Ullman,  1987).  Section  3  explores  problems  in  intermediate  and  high  level  vision,  and  ad¬ 
dresses  two  major  problems.  The  first  is  the  extraction  of  shape  properties  of  figures  and 
spatial  relations  among  items  in  the  scene.  The  second  is  visual  object  recognition. 

2.  LOW  LEVEL  VISION 

2.1  Edge  Detection 

When  the  image  is  first  captured  by  the  retinal  photoreceptors  in  the  eye,  it  consists 
of  some  120  million  individual  pointwise  measurements  of  lignt  intensity.  A  major  problem 
that  arises  is  how  to  transform  this  large  and  unstructured  set  of  measurements  into  a  more 
useful  representation,  that  is,  a  representation  that  is  more  concise  and  more  convenient  for 
subsequent  processing  stages. 

In  the  fields  of  computer  vision  and  image  processing,  the  dominant  suggestion  has  been 
to  begin  the  analysis  of  the  incoming  image  by  producing  from  the  intensity  array  an  edge 
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representation,  using  a  process  called  edge  detection .  Edges  in  the  image  are  locations  where 
the  light  intensity  changes  significantly  from  one  level  to  a  different  one.  The  main  rationale 
for  using  an  edge  representation  as  the  primary  representation  is  that  intensity  edges  are 
usually  physically  significant:  they  correspond  to  object  boundaries  and  to  discontinuities 
in  surface  properties,  such  as  spatial  orientation  or  reflectance.  The  important  role  of  edges 
is  also  supported  by  the  fact  that  for  humans,  a  line  sketch  of  an  image  often  conveys  most 
of  the  essential  information,  although  from  the  point  of  view  of  the  underlying  intensity 
distributions,  the  line  sketch  and  the  image  are  radically  different.  Similarly,  in  experiments 
involving  pattern  recognition  by  animals  (such  as  the  rat,  octopus,  and  goldfish),  it  has  been 
noted  that  figures  are  usually  treated  as  equivalent  to  a  sketch  of  their  outlines  (Sutherland, 
1968).  Early  perceptual  studies  by  Cornsweet  and  others  (see,  for  example,  Cornsweet, 
1970)  also  show  that  our  perception  of  an  image  does  not  directly  mimic  the  initial  intensities 
registered  by  the  eye,  but  is  strongly  influenced  by  the  presence  of  sharp  intensity  changes, 
or  edges.  An  example  of  an  edge  representation  obtained  from  a  natural  image,  using  am 
edge  detection  method  for  computer  vision  systems  proposed  by  Canny  (1986)  is  shown  in 
Figure  1. 

It  should  be  mentioned  that  the  operation  of  edge  detection  is  not  the  only  proposal 
that  has  been  made  concerning  the  initial  representation  of  the  incoming  image.  A  popular 
alternative  has  been  a  local  Fourier  (or  Gabor)  decomposition  of  the  image.1  These  alter¬ 
native  suggestions  have  not  yet  played  as  significant  a  role  in  computational  vision,  and  will 
not  be  discussed  further  here. 

In  the  discussion  below  we  will  consider  primarily  one  family  of  edge  detection  schemes 
that  are  based  on  detecting  maxima  in  the  rate  of  change  (defined  as  maxima,  or  peaks,  in 
the  first  derivative  or  zero-crossings  in  the  second)  of  the  image  intensities.  The  reasons 
for  concentrating  on  this  group  are  first,  that  such  schemes  have  produced  high  quality 
results  that  have  been  used  successfully  in  later  stages  of  visual  processing  in  computer 
vision  systems,  and  second,  interesting  possible  connections  have  been  proposed  between 
this  scheme  and  edge  detection  by  the  human  visual  system.  For  reviews  of  various  edge 
detection  techniques,  see,  for  example,  Davis  (1975),  Fram  and  Deutsch  (1975),  Pratt  (1978), 
Ballard  and  Brown  (1982),  Torre  and  Poggio  (1986),  Fischler  and  Firschein  (1987),  and 
Hildreth  (1987). 


JThe  description  of  the  image  using  Fourier  components  is  based  on  the  mathematical  notion 
that  any  periodic  function  can  be  expressed  as  the  sum  of  more  elementary  functions  (sines  and 
cosines),  each  multiplied  by  an  appropriate  scaling  factor  (see,  for  example,  Brace  well,  1978), 
The  set  of  scaling  factors  (called  the  Fourier  coefficients)  can  serve  to  characterize  the  function 
uniquely.  The  Fourier  approach  to  early  vision  suggests  that  the  incoming  image  is  divided  into 
regions,  and  for  each  region  the  visual  system  constructs  a  representation  that  resembles  a  Fourier 
decomposition. 
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Differentiation  and  smoothing 

To  produce  an  edge  representation  of  the  image,  one  needs  a  more  precise  definition 
of  what  is  meant  by  “significant”  intensity  changes.  One  definition  that  has  been  proposed 
is  to  identify  edges  as  locations  in  the  image  where  the  (directional)  derivative  of  intensity 
reaches  a  local  maximum  in  its  absolute  value.  It  can  be  shown  that,  under  quite  general 
conditions,  when  a  signal  passes  through  an  optical  system,  step  edges  in  the  incoming  inten¬ 
sity  distribution  (before  the  optics)  will  give  rise  to  maxima  in  the  first  directional  derivative 
of  image  intensity  (after  the  optics).  Locating  derivative  maxima  therefore  appears  to  be  a 
reasonable  step  toward  identifying  sharp  edges. 

The  differentiation  of  a  signal,  however,  raises  certain  technical  difficulties,  because 
this  operation  enhances  considerably  the  high  frequency  components  of  the  signal.  Because 
images  often  contain  high  frequency  noise,  the  noise  is  enhanced  by  differentiation.2  To 
overcome  this  problem,  the  differentiation  of  a  signal  is  usually  preceded  by  a  smoothing 
operation  that  removes  the  noisy  high  frequency  components.  It  has  been  shown  that 
optimal  results  can  be  obtained  by  using  a  smoothing  function  that  has  a  Gaussian,  or 
close  to  a  Gaussian  shape  (Shanmugam,  Dickey  and  Green,  1979;  Marr  and  Hildreth,  1980; 
Koenderink,  1984;  Babaud  et  al,  1986;  Canny,  1986;  Torre  and  Poggio,  1986).  Thus, 
one  approach  to  edge  detection  is  to  perform  the  following  steps:  (1)  smooth  the  signal 
using  a  Gaussian  filter,  (2)  differentiate  the  signal,  and  (3)  locate  maxima  in  its  directional 
derivatives.3 

Mathematically,  these  stages  can  be  summarized  (for  a  one-dimensional  signal)  as  lo¬ 
cating  maxima  in  j^{G  *  /),  where  G  is  a  Gaussian  function  and  *  denotes  the  filtering, 
ox  convolution  operation.  Interestingly,  the  same  operation  can  be  obtained  by  performing 
(j^G)  *  /.  This  means  that  the  two  successive  operations,  smoothing  and  differentiation, 
can  be  combined.  The  image  is  filtered  through  a  new  function,  the  first  derivative  of  a 
Gaussian.  Maxima  in  the  first  derivative  of  the  smoothed  image  can  be  detected  equiva¬ 
lently  by  finding  locations  where  the  second  derivative  of  the  smoothed  image  crosses  zero. 
If  we  again  use  Gaussian  smoothing,  then  mathematically,  this  second  scheme  corresponds 
to  finding  zero-crossings  in  the  signal  (jp-G)  *  /.  To  summarize,  edges  can  be  found  by 
(1)  passing  the  image  through  a  filter  whose  shape  is  the  first  or  second  derivative  of  a 
Gaussian,  and  (2)  locating  features  such  as  maxima  or  zero-crossings  in  the  result  of  this 
filtering  stage.  For  the  case  of  one  dimension,  the  use  of  maxima  and  zero-crossings  in  the 
derivatives  of  Gaussian-filtered  images  is  illustrated  in  Figure  2.  The  figure  shows  a  one¬ 
dimensional  intensi  ;  profile  obtained  by  taking  a  slice  through  a  natural  image,  the  same 

2Because  the  derivative  of  a  sine  function  sin(wx)  is  u/cos(w*),  it  follows  that  the  effect  of  differ¬ 
entiation  is  to  enhance  periodic  components  with  high  spatial  frequency  (large  u>). 

3  One  can  detect  maxima  in  the  absolute  value  of  this  derivative,  or  equivalently,  maxima  or 
minima  (positive  or  negative  peafcj)  in  the  derivative  itself. 


Figure  2.  Detecting  intensity  changes,  (a^  One-dimensional  intensity  profile  obtained  by  measur¬ 
ing  intensities  along  a  horizontal  slice  of  a  natural  image,  (b)  The  result  of  smoothing  the  profile  in 
(a),  (c)  The  result  of  additional  smoothing  of  (a),  (d)  and  (e)  The  first  and  second  derivatives,  re¬ 
spectively,  of  the  smoothed  profile  shown  in  (c).  The  vertical  dashed  lines  indicate  peaks  in  the  first 
derivative  and  zero-crossings  in  the  second  derivative  that  correspond  to  two  significant  intensity 
changes  (Hildreth,  1987). 
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profile  following  G&assian  filtering,  and  how  the  edges  can  be  identified  by  the  maxima  (or 
peaks)  in  the  first  derivative  or  zero-crossings  in  the  second. 

Peeks  and  zero- crossings  in  one  and  two  dimensions 

This  basic  scheme,  presented  above  for  a  one-dimensional  signal,  has  a  number  of  pos¬ 
sible  extensions  to  the  analysis  of  two-dimensional  images.  One  alternative  uses  directional 
first  or  second  derivative  operators4  for  performing  the  two-dimensional  differentiation  (see, 
for  example,  Macleod,  1972;  Davis,  1975;  Persoon,  1976;  Rosenfeld  and  Kak,  1976;  Pratt, 
1978;  Haralick,  1980;  Binford,  1981;  Nevatia  and  Babu,  1980;  Zucker,  1987).  A  second 
strategy  is  to  evaluate,  at  each  location  in  the  image,  the  direction  of  the  intensity  gradient, 
which  is  the  direction  along  which  the  intensity  changes  most  rapidly.  A  one-dimensional 
analysis  of  intensity  changes  is  then  applied  locally  in  this  direction  (for  example,  Canny, 
1986).  In  a  third  alternative,  the  image  is  first  passed  through  a  single,  non-directional  fil¬ 
ter,  whose  shape  is  defined  by  V2G,  the  Laplacian  of  a  two-dimensional  Gaussian  function 
(the  Laplacian  is  the  sum  of  the  two  directional  derivatives  in  the  horizontal  and  vertical 
directions),  and  then  followed  by  the  detection  of  the  zero-crossings  in  the  result  of  this 
filtering  stage.  The  V2G  operator  is  shown  in  Figure  3.  It  is  interesting  to  note  in  this 
regard  that  under  quite  general  conditions,  the  zero-crpssings  representation  produced  in 
this  manner  gives  a  complete  representation  of  the  image;  that  is,  the  original  image  can  be 
reconstructed  (up  to  a  single  scaling  factor)  from  the  zero-crossing  contours  alone  (Curtis 
and  Oppenheim,  1987). 

Multiple  scales 

Intensity  edges  can  occur  in  an  image  at  a  variety  of  scales.  Some  edges  are  sharp 
and  close  together,  others  are  gradual  and  well  separated.  It  proved  difficult  to  capture 
all  of  the  different  types  of  significant  intensity  changes  using  only  a  single  fixed  operator. 
An  approach  that  has  evolved  in  computational  vision  in  response  to  this  problem  is  to 
analyze  the  image  at  a  number  of  different  scales  (for  example,  Rosenfeld  and  Thurston, 
1971;  Marr  and  Hildreth,  1980;  Mayhew  and  Frisby,  1981;  Witkin,  1983;  Canny,  1986).  In 
a  coarse  scale  edge  representation,  only  a  limited  number  of  significant,  relatively  isolated 
edges  would  be  revealed,  whereas  a  fine  scale  (high  resolution)  analysis  would  yield  a  denser 
representation  of  the  edges.  In  terms  of  the  edge  detection  scheme  outlined  above,  the 
resolution  or  scale  of  the  analysis  is  determined  by  the  size  of  the  operator  (such  as  the 
value  <7  of  the  space  constant  of  the  underlying  ^aussian  in  the  V2G  operator).  Multiple 
scale  analysis  is  therefore  achieved  by  the  simultaneous  use  of  a  number  of  operators  of 
different  sizes,  as  shown  in  Figure  4. 

4  These  operators  compute  the  derivative  of  the  smoothed  image  along  a  particular  two-dimensional 
direction. 


a. 


Figure  5.  The  V2G  operator,  (a)  Three-dimensional  plot  of  the  shape  of  the  two-dimensional 
operator,  (b)  A  one-dimensional  cross-section  through  the  center  of  the  operator. 


This  general  idea  of  multiple  scale  analysis  is  in  agreement  with  the  large  body  of 
psychophysical  data  regarding  multiple  channels  in  the  visual  system.  Experiments  using 
techniques  such  as  detection,  adaptation,  and  discrimination,  suggest  that  at  each  location  in 
the  visual  field,  a  number  of  mechanisms  analyze  the  image  at  a  range  of  different  scales  (for 
example,  Campbell  and  Robson,  1986;  Cowan,  1977;  Graham,  1977;  Watson  and  Nachmias, 
1977;  Wilson  and  Bergen,  1979).  These  mechanisms  appear  to  be  sensitive  to  different 
ranges  of  spatial  and  temporal  frequency. 


Biological  significance 


One  question  regarding  edge  representations  based  on  features  such  as  peaks  and  zero- 
crossings  is  their  biological  relevance.  It  is  known  that  the  appropriate  kind  of  filtering  takes 
place  at  the  level  of  the  retina  and  the  lateral  geniculate  nucleus.  That  is,  at  this  level,  the 
image  can  be  thought  of  as  being  passed  through  filters,  or  receptive  fields,  whose  shape  can 
be  closely  approximated  by  the  V2G  shape  presented  above.  It  is  not  entirely  clear,  however, 
what  processing  takes  place  beyond  this  first  stage,  at  the  level  of  the  primary  visual  cortex. 
It  is,  in  fact,  an  instructive  point  that  in  spite  of  the  large  body  of  Knowledge  regarding 
the  anatomy  and  physiology  of  the  primary  visual  cortex,  relatively  little  is  known  about 
the  functions  it  performs.  Many  cortical  studies  are  in  general  agreement  with  the  notion 
that  this  area  may  be  involved,  among  other  functions,  with  the  task  of  edge  detection. 
In  particular,  the  so-called  cortical  simple  cells  have  often  been  described  as  edge  and 
bar  detectors.  This  view  is  not  universally  accepted,  however  (see,  for  example,  DeValois, 
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Albrecht  and  Thorell,  1982),  and  the  exact  function  of  these  and  other  cortical  units  remains 
unclear. 

A  possibility  raised  by  the  preceding  discussion  is  that  one  of  the  functions  of  this  corti¬ 
cal  area  may  be  to  construct  a  number  of  zero-crossing  maps  at  different  scales,  and  then  to 
combine  them  to  construct  an  edge  representation  of  the  incoming  image.  This  point  of  view 
regarding  the  possible  function  of  the  primary  visual  cortex  gives  rise  to  specific  predictions 
that  can  be  tested  by  physiological  techniques.  Some  of  these  predictions  have,  in  fact,  been 
tested  (Richter  and  Ullman,  1986),  and  the  results  appear  generally  to  be  compatible  with 
a  zero-crossing  based  view.  Recent  psychophysical  studies  (for  example,  Watt  and  Mor¬ 
gan,  1983,  1984)  provide  some  support  for  the  use  of  zero-crossings  in  human  vision,  but 
show  that  additional  primitive  features,  such  as  stationary  points  in  an  approximation  to 
the  second  derivative  of  the  image  intensities,  are  essential  to  account  for  the  ability  of  the 
human  visual  system  to  detect  and  localize  the  positions  of  significant  intensity  changes  in 
the  image.  The  experiments  performed  so  far  are  insufficient  to  determine  conclusively  the 
role  of  edges  in  early  vision,  but  they  serve  to  demonstrate  that  connections  are  starting  to 
develop  between  computational  theories  on  the  one  hand,  and  empirical  biological  studies 
of  brain  mechanisms  on  the  other. 


Figure  5.  Subjective  contours.  Examples  of  boundaries  that  can  be  perceived,  but  which  cannot 
be  detected  directly  by  an  intensity-based  edge  detection  scheme. 


A  final  comment  before  leaving  the  topic  of  edge  detection  is  that  even  if  features  such 
as  peaks  or  zero-crossings  in  the  result  of  smoothing  and  differentiating  the  image  are 
being  detected,  it  should  be  realized  that  they  constitute  only  one  stage  in  the  analysis  of 
boundaries.  As  mentioned  above,  edge  representations  at  different  scales  must  somehow 
be  combined.  In  addition,  there  are  many  types  of  boundaries  that  humans  are  sensitive 
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to,  but  which  cannot  be  captured  by  the  zero-crossing  scheme  or  by  other  intensity-based 
edge  detection  techniques.  Examples  of  such  boundaries  include  texture  edges,  in  which  the 
two  sides  of  the  boundary  differ  in  texture  rather  than  in  average  intensity,  and  subjective 
contours,  which  the  visual  system  fills  in  from  different  types  of  evidence  when  a  physical 
discontinuity  in  the  image  is  not  present  (see  Figure  5).  It  is  interesting  to  note  that 
recently,  cells  have  been  found  rather  early  in  the  visual  system,  in  cortical  area  V2,  that 
clearly  respond  to  subjective  contours  (von  der  Heydt  et  a/.,  1984). 

2.2  Binocular  Stereo 

The  left  and  right  eyes  obtain  two  slightly  different  views  of  the  world,  and  we  use  the 
difference  between  these  two  views  to  recover  the  distances  to  surfaces  in  the  environment. 
In  particular,  there  is  disparity  between  the  positions  of  corresponding  features  in  the  two 
retinal  images,  which  depends  both  on  where  the  objects  axe  located  in  space,  and  where 
the  eyes  are  focused.  Binocular  stereo,  the  process  that  uses  these  disparities  to  recover  3-D 
information,  is  a  critical  component  of  human  vision  —  it  is  considered  our  most  accurate 
system  for  recovering  the  distances  to  surfaces  (Westheimer  and  McKee,  1978,  1980;  PoggTo 
and  Poggio,  1984). 

Examining  the  process  of  stereo  vision  from  a  computational  perspective,  there  are 
three  main  steps  that  need  to  be  solved.  The  first  is  to  extract  elements  from  the  images, 
whose  positions  will  be  compared  in  the  two  stereo  views.  The  second  is  to  match  up  these 
elements  in  the  two  views;  that  is,  for  each  element  in  the  left  image,  an  element  is  located 
in  the  right  image  that  corresponds  to  the  same  physical  feature  in  the  scene.  Finally, 
through  a  geometric  transformation,  the  disparities  in  position  of  features  in  the  two  images 
determine  the  distances  to  objects  in  space.  It  turns  out  that  the  second  step,  referred  to  as 
the  stereo  correspondence  problem ,  is  one  of  the  most  difficult  aspects  of  the  stereo  process. 

When  we  consider  stereo  views  of  a  natural  scene,  such  as  those  shown  in  Figure  6a, 
it  might  appear  that  the  stereo  correspondence  problem  ought  to  be  straightforward.  One 
could,  for  example,  extract  small  patches  from  the  left  image,  and  look  for  the  most  similar 
patch  of  image  intensities  in  the  right  image.  This  strategy  is  referred  to  as  grey-level 
correlation,  and  it  formed  the  basis  for  early  stereo  systems  developed  in  computer  vision 
(for  review,  see  Barnard  and  Fischler,  1982),  as  well  as  early  models  of  human  stereo  vision 
(for  example,  Sperling,  1970).  Techniques  based  on  matching  the  image  intensities  did  not 
work  well  in  practice,  primarily  for  two  reasons.  First,  the  image  intensities  can  change 
significantly  between  the  two  views  —  they  undergo  photometric  and  geometric  distortions. 
An  example  of  a  photometric  change  between  the  two  images  is  a  bright  highlight  that 
appears  in  one  image  but  not  the  other,  because  of  the  way  the  surface  catches  the  light. 
An  example  of  a  geometric  change  is  a  rotation  of  one  image  relative  to  the  other.  Such 
rotations  often  occur  when  the  angle  of  gaze  deviates  significantly  from  straight  ahead.  These 
distortions  can  lead  to  significant  errors  in  stereo  methods  based  on  grey-level  correlation. 


b. 

Figure  8.  Stereo  analysis,  (a)  Aerial  stereo  photographs  of  a  section  of  the  campus  at  the 
University  of  British  Columbia,  (b)  The  disparities  computed  by  the  Marr-Poggio-Grimson  stereo 
correspondence  algorithm  (Grimson,  1985),  for  the  images  shown  in  (a).  Brightness  encodes  depth, 
with  brighter  contours  being  closer  to  the  viewer. 


There  is  a  second  problem,  however,  that  is  more  severe  —  there  is  tremendous  ambi¬ 
guity  in  matching  features  between  the  two  images.  Given  some  patch  in  one  image,  there 
are  often  many  patches  in  the  other  that  are  quite  similar.  This  ambiguity  is  exacerbated 
by  the  wide  range  of  depths  contained  in  a  typical  scene,  from  which  it  follows  that  there 
will  be  a  wide  range  of  possible  disparities  between  features  in  the  two  images.  To  solve  the 
stereo  correspondence  problem,  the  matching  process  needs  to  be  constrained,  in  order  to 
derive  a  unique  matching  of  features  between  the  two  stereo  views. 

The  ambiguity  problem  is  best  illustrated  with  stereo  images  that  consist  of  random 
dots,  as  shown  in  Figure  7  (Julesz,  1971).  The  left  image  is  a  random  array  of  black  and 
white  dots.  To  construct  the  right  image,  patterns  of  dots  from  the  left  image  are  shifted  by 
different  amounts  to  the  left  or  right  in  the  right  image,  creating  disparity  in  their  positions 
in  the  two  stereo  views.  In  the  case  of  Figure  7,  a  central  patch  of  dots  in  the  left  image  is 
shifted  to  the  right  in  the  right  image.  If  the  resulting  patterns  are  observed  in  such  a  way 
that  the  left  eye  views  only  the  left  pattern  and  the  right  eye  views  the  right,  a  perception  of 
3-D  surfaces  at  different  depths  results  from  the  relative  placement  of  the  dots.  The  ability 
to  ‘fuse’  random-dot  stereograms  to  form  a  perception  of  depth  illustrates  two  important 
aspects  of  stereo  vision.  First,  the  features  that  are  matched  between  the  two  images  must 
be  simple  —  stereo  fusion  does  not  require  that  we  recognize  objects  in  the  image,  and  match 
entire  objects  between  the  two  stereo  views.  Second,  it  illustrates  the  inherent  ambiguity 
faced  by  the  stereo  correspondence  process.  In  principle,  any  dot  in  the  left  image  could 
match  any  dot  in  the  right  image  of  the  same  color. 


Figure  7.  A  random-dot  stereogram  with  disparities  that  reveal  two  depth  planes  stacked  on  top 
of  one  another. 
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To  solve  the  stereo  correspondence  problem  uniquely,  it  is  necessary  to  constrain  the  match¬ 
ing  process  in  some  way.  Computational  models  of  the  stereo  correspondence  process  use 
a  variety  of  such  constraints.  First,  an  object  in  space  occupies  only  a  single  location  at 
one  time,  which  implies  that  each  feature  in  the  left  image  matches  only  a  single  feature  in 
the  right  image.  This  is  referred  to  as  the  uniqueness  constraint.  Second,  because  of  the 
nature  of  stereo  projection,  features  located  along  a  particular  line  in  one  image  all  match 
with  features  along  a  line  in  the  other  image.  If  the  eyes  are  focused  at  a  distance,  then 
these  corresponding  lines  are  roughly  horizontal  and  at  the  same  vertical  height  in  the  two 
images.  If,  however,  the  eyes  are  focused  at  a  nearby  object,  or  the  angle  of  gaze  deviates 
significantly  from  being  straight  ahead,  then  these  lines  have  different  orientations  in  the 
two  images.  If  the  positions  of  the  eyes  in  their  orbit  are  known,  then  it  is  possible  to  deter¬ 
mine  where  the  corresponding  lines  lie  in  the  two  images.  This  facilitates  stereo  matching, 
because  now,  given  a  feature  in  one  image,  it  is  only  necessary  to  search  along  a  single  line 
in  the  other  image  for  a  possible  match.  Virtually  all  models  of  stereo  vision  assume  that 
the  viewing  geometry  is  known.  This  is  referred  to  as  the  epipolar  constraint. 

Finally,  all  models  of  stereo  correspondence  assume  some  form  of  the  constraint  that  the 
distance  to  surfaces  in  the  scene  tends  to  vary  slowly  across  the  image  (for  example,  Marr 
and  Poggio,  1976,  1979;  Mayhew  and  Frisby,  1981;  Baker,  1982;  Barnard  and  Fischler,  1982; 
Medioni  and  Nevatia,  1985;  Pollard,  Mayhew  and  Frisby,  1985;  Prazdny,  1986;  Barnard, 
1987;  Hoff  and  Ahuja,  1987).  Surfaces  are  usually  smooth,  so  that  the  depths  of  nearby 
points  on  the  surface  are  usually  similar.  Computational  models  have  used  this  continuity 
constraint  in  different  ways.  Some  try  to  match  features  in  the  two  views  in  such  a  way 
that  the  variation  in  depth  of  the  resulting  surfaces  is  minimized  (for  example,  Prazdny, 
1986;  Barnard,  1987).  Others  impose  a  constant  limit  on  the  rate  of  change  of  disparity 
across  the  image  —  in  other  words,  they  do  not  allow  matches  to  be  made  that  would  imply 
a  surface  that  is  too  steeply  slanted  in  depth  (for  example,  Medioni  and  Nevatia,  1985; 
Pollard,  Mayhew  and  Frisby,  1985).  This  latter  approach  was  motivated  by  psychophysical 
studies  that  suggest  that  in  human  stereo  vision,  there  is  a  strict  limit  on  the  gradient  of 
stereo  disparity  (for  example,  Tyler,  1975;  Burt  and  Julesz,  1980).  Finally,  some  models 
use  a  form  of  this  constraint  based  on  the  idea  that  disparities  vary  slowly  along  connected 
edge  contours  in  the  image,  referred  to  as  figural  continuity  (Mayhew  and  Frisby,  1980, 
1981;  Baker  and  Binford,  1981;  Baker,  1982;  Medioni  and  Nevatia,  1985;  Grimson,  1985). 
These  latter  models  assume  that  features  along  continuous  contours  in  the  image  are  likely 
to  lie  along  contiguous  locations  on  a  surface  in  space,  and  should  therefore  have  similar  or 
smoothly  varying  disparities.  Mayhew  and  Frisby  (1981)  present  psychophysical  evidence 
that  a  figural  continuity  constraint  is  used  in  human  stereo  vision. 

Simple  strategies  for  computing  stereo  correspondence,  such  as  the  grey-level  correla¬ 
tion  technique  mentioned  earlier,  work  well  on  random-dot  patterns  and  highly  textured 
natural  images,  even  when  the  underlying  surfaces  are  quite  complex.  The  need  to  provide 
a  robust  solution  for  a  wide  range  of  natural  imagery  has  proven  to  be  a  difficult  problem, 
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however,  which  appears  to  require  a  more  complex  strategy.  In  search  of  more  successful 
algorithms,  many  computational  models  attempt  to  incorporate  knowledge  of  human  stereo 
vision.  These  models,  in  turn,  provide  insight  into  how  these  aspects  of  human  vision  are 
useful  in  solving  the  stereo  correspondence  problem. 

There  are  at  least  three  observations  regarding  human  stereo  vision  that  have  been 
incorporated  into  computational  models.  First,  as  we  noted  in  the  previous  section,  there  is 
much  evidence  that  the  human  visual  system  initially  extracts  features  in  the  image,  such  as 
sharp  intensity  changes  or  edges,  and  it  has  been  suggested  that  these  features  may  serve  as 
the  basic  matching  elements  for  stereo  correspondence  (for  example,  Barlow,  Blakexnore  and 
Pettigrew,  1967;  Marr  and  Poggio,  1979;  Mayhew  and  Frisby,  1981).  Second,  perceptual 
demonstrations  suggest  that  the  human  stereo  system  makes  use  of  the  different  resolutions 
of  analysis  performed  by  the  early  stages  of  vision  (for  example,  Felton,  Richards  and  Smith, 
1972;  Julesz  and  Miller,  1975).  Finally,  it  has  been  shown  that  eye  movements  are  critical  in 
fusing  two  stereo  views.  If  the  eyes  are  fixated  at  a  particular  distance,  it  is  only  possible  to 
fuse  features  in  the  left  and  right  views  whose  depths  fall  within  a  limited  range  around  the 
fixation  distance  (known  as  Panum’s  fusional  area;  see,  for  example,  Mitchell.  1966;  Fender 
and  Julesz,  1967).  Objects  outside  of  this  range  are  seen  as  double.  One  method  that  can 
be  used  to  fuse  features  over  a  wider  range  of  depths  is  to  perform  vergence  eye  movements, 
fixating  on  different  depth  planes. 

Many  stereo  models  incorporate  some  or  all  of  these  factors.  For  example,  most  stereo 
systems  match  features  between  the  two  images,  such  as  sharp  intensity  changes,  edges,  or 
other  related  features  (for  example,  Marr  and  Poggio,  1979;  Barnard  and  Thompson,  1980; 
Mayhew  and  Frisby,  1981;  Grimson,  1981,  1985;  Baker  and  Binford,  1981;  Baker,  1982; 
Barnard  and  Fischler,  1982;  Medioni  and  Nevatia,  1985;  Pollard,  Mayhew  and  Frisby,  1985; 
Prazdny,  1986;  Hoff  and  Ahuja,  1987).  Some  models  also  assume  that  the  corresponding 
features  in  the  two  eyes  are  similar  to  one  another  in  their  contrast,  orientation  in  the  image, 
sharpness,  and  so  on,  which  can  also  narrow  down  possible  matches  (for  example,  Arnold 
and  Binford,  1980;  Baker  and  Binford,  1981;  Baker,  1982;  Medioni  and  Nevatia,  1985). 
Stereo  correspondence  models  that  match  features  obtained  at  multiple  resolutions  include 
those  proposed  by  Marr  and  Poggio  (1979),  Moravec  (1980),  Grimson  (1981,  1985),  Mayhew 
and  Frisby  (1981),  and  Hoff  and  Ahuja  (1987). 

The  analysis  of  appropriate  matching  features  in  computational  models  of  stereo  cor¬ 
respondence  has  motivated  a  number  of  psychophysical  studies  aimed  at  determining  which 
features  are  used  in  human  stereo  vision.  Mayhew  and  Frisby  (1981)  provide  evidence  that 
stereo  matching  features  are  extracted  after  the  image  has  been  filtered  with  a  circularly 
symmetric  operator  (such  as  the  circular  difference-of-Gaussians  function),  and  include  the 
locations  of  peaks  in  this  filtered  image.  Bulthoff  and  Mallot  (1987)  demonstrate  that  the 
human  stereo  system  can  derive  a  3-0  impression  by  matching  smooth  variations  in  shad¬ 
ing,  but  that  the  sensation  of  depth  is  much  weaker  than  that  derived  in  the  presence  of 
sharp  edge  features  in  the  stereo  images. 
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The  stereo  correspondence  model  proposed  by  M&rr  and  Poggio  (1979)  was  presented 
as  a  possible  model  of  human  stereo  vision,  and  has  three  main  components.  First,  stereo 
matching  takes  place  independently  at  different  resolutions.  Second,  within  each  resolution, 
there  is  a  limit  to  the  amount  of  disparity  over  which  features  can  be  matched.  At  the 
coarsest  scale,  the  algorithm  searches  over  larger  distances  for  matching  features,  whereas 
at  the  finest  resolution,  there  is  a  more  severe  limit  on  the  distance  over  which  the  image  is 
searched  for  possible  matches.  Third,  the  matches  established  at  the  coarse  resolution  guide 
eye  movements  that  shift  the  entire  images  right  and  left  in  a  way  that  eventually  allows 
the  matching  of  features  at  all  resolutions.  Overall,  matches  at  the  coarse  scale  provide 
a  rough  idea  of  where  objects  are  located  in  depth,  and  this  rough  depth  map  becomes 
successively  refined  as  matches  are  established  at  finer  scales.  The  system  incorporates 
roles  for  multi-resolution  analysis  and  eye  movements,  which  are  known  to  exist  in  human 
stereo  vision.  This  algorithm  (see  also,  Grimson,  1981,  1985)  has  been  implemented  in  a 
number  of  computer  vision  systems,  and  is  one  of  the  most  thoroughly  tested  of  existing 
stereo  systems.  Figure  6b  shows  the  results  of  this  algorithm  applied  to  the  two  aerial  stereo 
photographs  shown  in  Figure  6a.  The  brightness  of  edge  contours  represents  depth,  with 
brighter  contours  closer  to  the  viewer.  Computer  models  of  this  sort  serve  as  a  rigorous  test 
for  models  of  human  stereo  processing. 

Other  models  of  stereo  correspondence  have  emerged  over  the  last  several  years  that 
also  attempt  to  mimic  aspects  of  the  human  stereo  system  (for  example,  Mayhew  and  Frisby, 
1980,  1981;  Pollard,  Mayhew  and  Frisby,  1985).  A  key  aspect  of  the  early  model  proposed 
by  Mayhew  and  Frisby  (1981),  for  which  they  provide  psychophysical  evidence,  is  that  the 
locations  of  image  features  at  multiple  spatial  resolutions  form  a  combined  “signature”  that 
is  matched  between  the  left  and  right  views.  This  differs  from  the  Marr-Poggio  (1979) 
model,  in  which  stereo  matching  takes  place  independently  at  different  spatial  scales.  The 
model  proposed  by  Pollard,  Mayhew  and  Frisby  (1985)  (the  PMF  algorithm)  combines  the 
epipolar  and  uniqueness  constraints  with  a  constraint  on  the  disparity  gradient.  That  is, 
the  rate  of  change  of  disparity  from  one  image  location  to  the  next  is  restricted  to  lie  within 
some  specified  bounds.  The  PMF  algorithm  has  two  stages.  In  the  first,  every  potential 
match  between  a  pair  of  features  in  the  left  and  right  images  is  assigned  an  initial  “strength” 
that  depends  on  the  number  of  other  potential  matches  within  a  small  neighborhood  whose 
disparities  fall  within  the  disparity  gradient  limit.  The  second  stage  is  a  relaxation  proce¬ 
dure,  in  which  matches  that  are  unambiguous  or  which  have  maximum  strengths  are  used  to 
disambiguate  other  potential  matches.  The  PMF  algorithm  was  motivated  by  psychophys¬ 
ical  studies  that  demonstrate  the  existence  of  a  disparity  gradient  limit  in  human  stereo 
vision  (for  example,  Tyler,  1975;  Burt  and  Julesz,  1980).  Variations  on  the  use  of  relaxation 
procedures  for  solving  the  stereo  correspondence  problem  have  also  been  considered  by  Marr 
and  Poggio  (1976),  Barnard  and  Thompson  (1981)  and  Prazdny  (1985). 

Given  the  importance  of  physical  constraints  in  guiding  the  nature  of  computational 
solutions  to  the  stereo  correspondence  problem,  one  would  expect  these  constraints  to  be 


18 


reflected  in  the  physiological  mechanisms  underlying  binocular  combination  in  biological 
systems.  There  has  been  considerable  interest  in  understanding  the  neural  circuitry  that 
underlies  stereo  processing  in  the  primate  visual  system  (for  review,  see  Poggio,  1984;  Poggio 
and  Poggio,  1984).  In  physiological  studies  by  Poggio  and  his  colleagues  (Poggio  and  Fischer, 
1977;  Poggio  and  Talbot,  1981;  Poggio,  1984),  populations  of  cells  were  found  in  cortical 
areas  VI  and  V2  of  the  macaque  monkey,  which  respond  to  the  disparity  in  position  of 
features  appearing  in  the  left  and  right  eyes.  Some  cells  respond  best  when  objects  appear 
in  front  of  the  point  of  fixation  of  the  two  eyes  (NEAR  cells),  other  cells  respond  best 
when  objects  appear  behind  the  fixation  point  (FAR  cells),  and  a  third  class  responds  to 
objects  at  depths  right  around  the  fixation  point.  Some  of  the  latter  class  of  cells  are 
excited  by  disparities  around  zero  (the  tuned  excitatory  cells),  while  others  are  inhibited 
by  these  disparities  (tuned  inhibitory  cells).  It  was  shown  recently  that  these  cells  respond 
to  disparity  in  complex  random-dot  stereograms  (Poggio,  1984),  so  their  output  effectively 
represents  the  result  of  solving  a  difficult  stereo  correspondence  problem. 

Several  of  the  constraints  used  in  computational  models  axe  reflected  in  the  neural 
processes  underlying  stereopsis.  For  example,  the  left  and  right  receptive  fields  of  disparity 
sensitive  cells  are  located  at  roughly  the  same  vertical  positions  in  the  two  eyes,  so  that  these 
cells  will  be  most  active  when  the  two  stereo  views  are  in  approximate  vertical  register.  (The 
human  stereo  system  can  tolerate  large  overall  vertical  disparities  between  the  left  and  right 
images,  but  it  appears  that  much  of  this  disparity  is  compensated  for  through  vertical  eye 
movements  that  bring  the  two  images  into  rough  vertical  register;  only  about  4'-7'  of  vertical 
disparity  can  be  tolerated  at  a  single  eye  position  (Nielsen  and  Poggio,  1984).)  The  fact 
that  corresponding  features  in  the  two  eyes  must  be  located  at  roughly  the  same  vertical 
location  is  analogous  to  the  assumption  in  computational  models  that  the  epipolar  geometry 
is  known. 

The  receptive  fields  of  disparity  sensitive  cells  are  also  located  at  roughly  the  same 
horizontal  positions  in  the  two  eyes.  For  many  of  the  NEAR  and  FAR  cells,  their  disparity 
sensitivity  for  bar  stimuli  extends  over  a  range  of  one  degree  or  more  on  either  side  of  zero 
disparity  (Poggio  and  Fischer,  1977;  Poggio  and  Talbot,  1981;  Poggio,  1984).  Others  have 
a  narrower  depth  response  profile.  This  overall  range  is  small  compared  to  the  actual  range 
of  disparity  that  typically  exists  for  a  natural  scene.  This  mechanism  effectively  constrains 
the  range  of  search  between  potential  corresponding  features  in  the  two  retinal  images,  for 
a  given  eye  position.  As  noted  earlier,  vergence  eye  movements  can  be  used  to  bring  objects 
over  a  wider  range  of  depths  into  roughly  corresponding  locations  on  the  two  eyes.  Based  on 
psychophysical  evidence  from  stereo-anomalous  observers,  Richards  ^  1971 )  proposed  that 
human  stereopsis  combines  the  activity  of  three  populations  of  neurons  that  are  preferentially 
sensitive  to  crossed,  near  zero,  and  uncrossed  disparities.  The  existence  of  the  three  classes  of 
NEAR,  FAR  and  TUNED  disparity  sensitive  cells  supports  this  hypothesis.  The  use  of  pools 
of  disparity  cells  with  limited  disparity  range  is  an  explicit  component  of  the  Marr-Poggio 
(1979)  stereo  model.  In  addition,  Marr  and  Poggio  proposed  that  multiple  spatial  frequency 
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channels  provide  input  to  these  disparity  sensitive  mechanisms,  with  coarser  channels  feeding 
mechanisms  that  are  sensitive  to  larger  ranges  of  disparity.  This  aspect  of  the  model  is 
supported  by  physiological  studies  of  binocular  mechanisms  in  the  cat  (Pettigrew  et  ai, 
1968;  Ferster,  1981),  which  show  a  strong  correlation  between  receptive  field  size  and  width 
of  disparity  tuning  curves. 

Physiological  evidence  does  not  yet  address  the  use  of  constraints  such  as  smoothness 
and  continuity  of  disparity  information,  or  the  use  of  a  disparity  gradient  limit.  These  con¬ 
straints  presumably  would  be  reflected  in  the  nature  of  the  interactions  between  neighboring 
disparity  sensitive  cells.  Many  of  the  computational  models  proposed  for  the  correspondence 
process  incorporate  simple,  local  interactions  between  nearby  disparity  mechanisms,  which 
could  easily  be  implemented  in  parallel  over  the  visual  field  (for  example,  Marr  and  Poggio, 
1976;  Barnard  and  Thompson,  1981;  Prazdny,  1985;  Pollard,  Mayhew  and  Frisby,  1985; 
Poggio  and  Drumheller,  1986).  Recent  efforts  have  begun  to  address  the  details  of  the 
neural  circuitry  that  might  underlie  the  parallel  computation  of  binocular  disparity.  Koch 
and  Poggio  (1987)  proposed  possible  neural  implementations  of  stereo  disparity  detectors, 
which  yield  specific  predictions  for  their  behavior  that  can  be  tested  through  physiological 
experiments. 

Finally,  we  mention  some  other  important  aspects  of  stereo  vision  that  have  been  ad¬ 
dressed  in  computational  and  experimental  studies,  but  which  are  not  covered  in  this  dis¬ 
cussion.  One  problem  is  the  detection  of  discontinuities  in  stereo  disparity,  which  often 
occur  at  the  locations  of  object  boundaries.  Such  discontinuities  violate  the  assumptions  of 
smoothness  and  continuity  of  disparity,  and  may  require  specialized  mechanisms  to  detect. 
A  second  problem  is  the  “filling-in”  of  continuous  surfaces  between  image  locations  where 
stereo  disparities  are  initially  computed.  Human  observers  perceive  continuous  surfaces 
when  observing  sparse  random-dot  stereograms.  Computational  models  that  initially  de¬ 
rive  disparity  measurements  only  at  the  locations  of  image  features,  such  as  edges,  also  raise 
the  problem  of  interpolating  a  dense  disparity  map  between  sparse  disparity  information. 
A  third  problem  is  the  transformation  between  stereo  disparity  and  depth,  which  is  likely 
to  require  knowledge  about  the  current  viewing  geometry,  such  as  the  position  of  the  eyes. 

2.3  The  Analysis  of  Visual  Motion 

The  measurement  and  use  of  visual  motion  is  one  of  the  most  fundamental  abilities  of 
biological  vision  systems,  serving  many  essential  functions.  For  example,  a  sudden  move¬ 
ment  in  the  scene  might  indicate  an  approaching  predator  or  desirable  prey.  The  rap'  ' 
expansion  of  features  in  the  visual  field  can  signal  an  object  about  to  collide  with  the  ob¬ 
server.  Discontinuities  in  motion  often  occur  at  object  boundaries  and  can  be  used  to  carve 
up  the  scene  into  distinct  objects.  Motion  signals  provide  input  to  centers  controlling  eye 
movements,  allowing  objects  of  interest  to  be  tracked  through  the  scene.  Relative  movement 
can  be  used  to  infer  the  3-D  structure  and  motion  of  object  surfaces,  and  the  movement 
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of  the  observer  relative  to  the  scene,  allowing  biological  systems  to  navigate  quickly  and 
efficiently  through  the  environment. 

The  pattern  of  movement  in  the  changing  image  is  not  given  to  the  visual  system 
directly,  but  must  be  inferred  from  the  changing  intensities  that  reach  the  eye.  The  3-D 
shape  of  object  surfaces,  the  locations  of  object  boundaries  and  the  movement  of  the  observer 
relative  to  the  scene  can  in  turn  be  inferred  from  the  pattern  of  image  motion.  Typically, 
the  overall  analysis  of  motion  is  divided  into  these  two  stages:  first,  the  measurement  of 
movement  in  the  changing  two-dimensional  (2-D)  image,  and  second,  the  use  of  motion 
measurements,  for  example  to  recover  the  3-D  layout  of  the  environment.  The  following 
two  sections  discuss  computational  studies  that  address  these  two  aspects  of  motion  analysis. 

The  measurement  of  motion 

The  measurement  of  movement  may  itself  be  divided  into  multiple  stages  and  per¬ 
formed  in  different  ways  in  biological  systems.  There  is  evidence  that  in  the  human  visual 
system,  motion  may  be  measured  by  at  least  two  different  processes,  termed  short-range  and 
long-range  processes  (Braddick,  1974,  1980).  The  short-range  process  analyzes  continuous 
motion,  or  motion  presented  discretely,  but  with  small  spatial  and  temporal  displacements 
from  one  moment  to  the  next.  The  long-range  process  may  then  analyze  motion  over  larger 
spatial  and  temporal  displacements,  as  in  apparent  motion.  There  is  evidence  that  these 
two  processes  interact  at  some  stage  (for  example,  Green,  1983),  but  initially  they  may  be 
somewhat  independent.  This  section  addresses  one  fundamental  aspect  of  the  short-range 
measurement  of  motion,  the  solution  of  the  aperture  problem. 

Consider  the  computation  of  a  projected  2-D  velocity  field,  which  assigns  to  each 
feature  in  the  image,  a  direction  and  speed  of  velocity.  Figure  8a  shows  an  example  of  the 
projected  velocity  field  for  a  3-D  wireframe  object  that  is  rotating  around  a  central  vertical 
axis.  In  both  biological  and  computer  vision  systems,  the  first  mechanisms  for  measuring 
this  motion  examine  only  a  limited  region  of  the  visual  image,  and  as  a  consequence,  often 
provide  only  partial  information  about  the  2-D  velocity  field,  due  to  the  aperture  problem 
(Wallach,  1976;  Fennema  and  Thompson,  1979;  Burt  and  Sperling,  1981;  Horn  and  Schunck, 
1981;  Marr  and  Ullman,  1981;  Adelson  and  Movshon,  1982).  This  problem  arises  when  an 
oriented  pattern  in  the  image,  such  as  an  edge  contour,  extends  beyond  the  region  of  the 
image  analyzed  by  the  initial  motion  detection  mechanisms.  The  information  provided  by 
such  mechanisms  is  illustrated  in  Figure  3b.  In  this  case,  an  extended  edge  E  moves  across 
the  image  and  its  movement  is  observed  through  a  window  defined  by  the  circular  aperture 
A.  Through  this  window,  it  is  only  possible  to  observe  the  movement  of  the  edge  in  the 
direction  perpendicular  to  its  orientation.  The  component  of  motion  along  the  orientation 
of  the  edge  is  invisible  through  this  limited  aperture.  Thus  it  is  not  possible  to  distinguish 
between  motions  in  the  directions  b,  c  and  d.  This  problem  is  inherent  in  any  motion 
detection  operation  that  examines  only  a  limited  area  of  the  image,  and  arises  when  an 
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oriented  pattern  extends  beyond  the  region  of  the  image  analyzed  by  the  initial  motion 
detection  mechanisms. 

As  a  consequence  of  the  aperture  problem,  the  measurement  of  motion  in  the  changing 
image  requires  two  stages  of  analysis:  the  first  measures  components  of  motion  in  the 
direction  perpendicular  to  image  features;  the  second  combines  these  components  of  motion 
to  compute  the  full  2-D  pattern  of  movement  in  the  image.  In  Figure  8c,  a  circle  undergoes 
pure  translation  to  the  right.  The  arrows  along  the  contour  represent  the  perpendicular 
components  of  velocity  that  can  be  measured  directly  from  the  changing  image.  These 
component  measurements  each  provide  some  constraint  on  the  possible  motion  of  the  circle, 
as  illustrated  in  Figure  8d.  The  bold  vector  v  represents  the  local  perpendicular  component 
of  motion  at  a  particular  location  in  the  image.  The  possible  true  motions  at  that  location  are 
given  by  the  set  of  velocity  vectors  whose  endpoint  lies  along  the  line  l  oriented  perpendicular 
to  the  vector  v.  Examples  of  possible  true  velocities  are  indicated  by  the  dotted  vectors. 
The  movement  of  image  features  such  as  corners  or  small  spots  can  be  measured  directly. 
In  general,  however,  the  first  measurements  of  movement  provide  only  partial  information 
about  the  true  movement  of  features  in  the  image,  and  must  be  combined  to  compute  the 
full  pattern  of  2-D  motion. 

The  measurement  of  movement  is  difficult  because  in  theory,  there  are  infinitely  many 
patterns  of  motion  that  are  consistent  with  a  given  changing  image.  For  example,  in  Figure 
8e,  the  contour  C  rotates,  translates  and  deforms  to  yield  the  contour  C'  at  some  later  time. 
The  true  motion  of  the  point  p  is  ambiguous.  Additional  constraint  is  required  to  identify  a 
unique  solution.  In  general,  it  may  not  be  possible  to  recover  the  2-D  projection  of  the  true 
3-D  field  of  motions  of  points  in  space,  from  the  changing  image  intensities.  Factors  such 
as  changing  illumination,  specularities  and  shadows  can  generate  patterns  of  optical  flow  in 
the  image  that  do  not  correspond  to  the  real  movement  of  surface  features.  The  additional 
constraint  used  to  measure  image  motion  can  yield  at  best  a  solution  that  is  most  plausible 
from  a  physical  standpoint. 

Many  physical  assumptions  could  provide  the  additional  constraint  needed  to  compute  a 
unique  pattern  of  image  motion.  One  possibility  is  the  assumption  of  pure  translation.  That 
is,  it  is  assumed  that  velocity  is  constant  over  small  areas  of  the  image.  This  assumption  has 
been  used  both  in  computer  vision  studies  and  in  biological  models  of  motion  measurement 
(for  example,  Lappin  and  Bell,  1976;  Pantle  and  Picciano,  1976;  Fennema  and  Thompson, 
1979;  Anstis,  1980;  Marr  and  Ullman,  1981;  Thompson  and  Barnard,  1981;  Adelson  and 
Movshon,  1982).  Methods  that  assume  pure  translation  may  be  used  to  detect  sudden 
movements  or  to  track  objects  across  the  visual  field.  These  tasks  may  require  only  a  rough 
estimate  of  the  overall  translation  of  objects  across  the  image.  Tasks  such  as  the  recovery  of 
3-D  structure  from  motion  require  a  more  detailed  measurement  of  relative  motion  in  the 
image,  and  require  the  use  of  a  more  general  physical  assumption. 
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Figure  8.  The  aperture  problem,  (a)  On  the  left  are  three  views  of  a  wireframe  object  undergoing 
rotation  around  a  central  vertical  axis.  On  the  right,  the  arrows  along  the  contours  of  the  object 
represent  the  instantaneous  velocities  of  points  along  the  object’s  contours  at  one  position  in  the 
object’s  trajectory,  (b)  An  operation  that  views  the  moving  edge  E  through  the  local  aperture  A 
can  compute  only  the  component  of  motion  c  in  the  direction  perpendicular  to  the  orientation  of  the 
Jge.  (c)  The  circle  undergoes  pure  translation  to  the  right;  the  arrows  represent  the  perpendicular 
components  of  velocity  that  can  be  measured  from  the  changing  image,  (d)  The  curve  C  rotates, 
translates  and  deforms  over  time  to  yield  the  curve  C\  The  velocity  of  the  point  p  is  ambiguous, 
(e)  The  vector  v  represents  the  perpendicular  component  of  velocity  at  some  location  in  the  image. 
The  true  velocity  at  that  location  must  project  to  the  line  /  perpendicular  to  v;  examples  are  shown 
with  dotted  arrows. 
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Other  computational  studies  have  assumed  that  velocity  varies  smoothly  across  the 
image  (Horn  and  Schunck,  1981;  Hildreth,  1984;  Nagel,  1984;  Nagel  and  Enkelmann,  1984; 
Anandan  and  Weiss,  1985).  The  assumption  rests  on  the  principle  that  physical  surfaces 
are  generally  smooth;  that  is,  variations  in  the  structure  of  a  surface  are  usually  small, 
compared  with  the  distance  of  the  surface  from  the  viewer.  When  surfaces  move,  nearby 
points  tend  to  move  with  similar  velocities.  There  exist  discontinuities  in  movement  at 
object  boundaries,  but  most  of  the  image  is  the  projection  of  relatively  smooth  surfaces. 
Thus,  it  is  natural  to  assume  that  image  velocities  vary  smoothly  over  most  of  the  visual 
field.  A  unique  pattern  of  movement  can  be  obtained  by  computing  a  velocity  field  that  is 
consistent  with  the  changing  image  and  has  the  least  amount  of  variation  possible. 

The  use  of  the  smoothness  constraint  allows  general  motion  to  be  analyzed.  Surfaces 
can  be  rigid  or  nonrigid,  undergoing  any  movement  in  space.  This  assumption  also  can  be 
embodied  in  the  motion  measurement  computation  in  a  way  that  guarantees  a  unique  so¬ 
lution,  which  is  often  physically  correct  (Hildreth,  1984;  UUman  and  Yuille,  1987).  Finally, 
the  velocity  field  of  least  variation  can  be  computed  straightforwardly,  using  standard  com¬ 
puter  algorithms  (Horn  and  Schunck,  1981;  Hildreth,  1984;  Nagel  and  Enkelmann,  1984), 
as  well  as  simple  analog  resistive  networks  that  resemble  the  properties  of  neural  networks 
(Poggio,  Torre  and  Koch,  1935;  Poggio  and  Koch,  1985).  Perceptual  studies  reveal  a  simi¬ 
larity  between  the  behavior  of  a  motion  measurement  computation  based  on  the  smoothness 
assumption  and  the  human  perception  of  motion  (Hildreth,  1984;  Nakayama  and  Silverman, 
1987a, b). 

The  aperture  problem  in  motion  measurement  can  be  examined  from  a  physiological 
perspective.  Early  movement  detectors  in  biological  systems  have  spatially  limited  recep¬ 
tive  fields,  and  therefore  face  the  aperture  problem  whenever  an  extended,  oriented  pattern 
moves  across  their  receptive  fields.  Stimulated  by  a  theoretical  analysis  of  this  problem, 
Movshon  et  ai  (1985)  sought  and  found  direct  physiological  evidence  for  a  two-stage  mo¬ 
tion  measurement  computation  in  the  primate  visual  system.  Two  visual  areas  that  include 
an  abundance  of  motion -sensitive  neurons  are  cortical  areas  VI  and  MT.  MT  is  the  middle 
temporal  area  of  extrastriate  cortex,  located  in  the  posterior  bank  of  the  superior  tempo¬ 
ral  sulcus  (Van  Essen  and  Maunsell,  1983).  Movshon  et  ai  (1985)  explored  the  type  of 
motion  analysis  taking  place  in  area  MT,  using  visual  stimuli  that  consist  of  superimposed 
sinewave  gratings,  with  different  orientations  and  directions  of  motion.  The  results  of  these 
experiments  indicate  that  the  selectivity  of  neurons  in  area  VI  for  direction  of  movement  is 
such  that  they  could  provide  only  the  component  of  motion  in  the  direction  perpendicular 
to  the  orientation  of  image  features.  Area  MT,  however,  contains  a  subpopulation  of  cells, 
referred  to  as  pattern  cells,  that  appear  to  respond  to  the  2-D  direction  of  motion  of  a  com¬ 
bined  grating  pattern,  independent  of  its  individual  components.  These  neurons  may  serve 
to  combine  motion  components  to  compute  the  real  2-D  direction  of  velocity  of  a  moving 
pattern.  These  experiments  do  not  yet  distinguish  between  the  use  of  the  simple  assump¬ 
tion  of  pure  translation,  as  suggested  in  the  study  (Movshon  et  a/.,  1985),  versus  a  more 
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general  assumption  such  as  smoothness.  Stimulus  patterns  undergoing  more  complicated 
motions  are  required  to  make  such  a  distinction.  If  the  pattern  cells  in  area  MT  embody  the 
assumption  of  smoothness  in  their  computation  of  motion,  one  would  expect  to  find  direct 
interaction  between  pattern  cells  that  analyze  nearby  areas  of  the  visual  field. 

The  recovery  of  3-D  structure  from  motion 

When  an  object  moves  in  space,  the  motions  of  individual  points  on  the  object  differ 
in  a  way  that  conveys  information  about  its  3-D  structure.  Using  2-D  shadow  projections 
of  3-D  wireframe  objects,  such  as  that  shown  in  Figure  8a,  Wallach  and  O’Connell  (1953) 
showed  that  the  human  visual  system  can  derive  the  correct  3-D  structure  of  moving  objects 
from  their  changing  2-D  projection  alone.  Other  perceptual  studies  also  demonstrated  this 
remarkable  ability  (for  example,  Braunstein,  1976;  Johansson,  1975;  Rogers  and  Graham, 
1979;  Ullman,  1979).  Relative  motion  in  the  image  is  also  created  by  movement  of  the 
observer  relative  to  the  environment,  and  can  be  used  to  infer  observer  motion  from  the 
changing  image  (for  example,  Gibson,  1950;  Johansson,  1971;  Lee,  1980;  Regan,  Kaufman 
and  Lincoln,  1986). 

Theoretically,  the  two  problems  of  (1)  recovering  the  3-D  structure  and  movement 
of  objects  in  the  environment  and  (2)  recovering  the  3-D  motion  of  the  observer  from  the 
changing  image,  are  closely  related.  The  main  difficulty  faced  by  both  is  that  infinitely  many 
combinations  of«3-D  structure  and  motion  could  give  rise  to  any  particular  2-D  image.  To 
resolve  this  inherent  ambiguity,  it  is  necessary  to  impose  additional  constraint  that  allows 
most  3-D  interpretations  to  be  ruled  out,  leaving  one  that  is  most  plausible  from  a  physical 
standpoint.  Computational  studies  have  used  the  rigidity  assumption  to  derive  a  unique 
3-D  structure  and  motion;  they  assume  that  if  it  is  possible  to  interpret  the  changing  2- 
D  image  as  the  projection  of  a  rigid  3-D  object  in  motion,  then  such  an  interpretation 
should  be  chosen  (for  example,  Ullman,  1979,  1983;  Clocksin,  1980;  Prazdny,  1980,  1983; 
Longuet-Higgins  and  Prazdny,  1981;  Tsai  and  Huang,  1981;  Mitiche,  1986;  Waxman  and 
Ullman,  1985;  Waxman  and  Wohn,  1987).  When  the  rigidity  assumption  is  used  in  this 
way,  the  recovery  of  structure  from  motion  requires  the  computation  of  the  rigid  3-D  object 
that  would  project  onto  a  given  2-D  image.  The  rigidity  assumption  was  suggested  by 
perceptual  studies  that  described  a  tendency  for  the  human  visual  system  to  choose  a  rigid 
interpretation  of  moving  elements  (Wallach  and  O’Connell,  1953;  Gibson  and  Gibson,  1957; 
Jansson  and  Johansson,  1973;  Johansson,  1975,  1977). 

Computational  studies  have  shown  that  the  rigidity  assumption  can  be  used  to  derive 
a  unique  3-D  structure  from  the  changing  2-D  image.  Furthermore,  this  unique  3-D  in¬ 
terpretation  can  be  derived  by  integrating  image  information  only  over  a  limited  extent  in 
space  and  in  time.  For  example,  suppose  that  a  rigid  object  in  motion  is  projected  onto 
the  image  plane  by  using  orthographic  projection.  Three  distinct  views  of  four  points  on 
the  moving  object  are  sufficient  to  compute  a  unique  rigid  3-D  structure  for  the  points 
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(Ullman,  1979).  In  general,  if  only  two  views  of  the  moving  points  are  considered  or  fewer 
points  are  observed,  there  are  multiple  rigid  3-D  structures  consistent  with  the  changing 
2-D  projection.  Theoretical  results  regarding  the  recovery  of  a  unique  3-D  structure  under 
a  variety  of  conditions  are  summarized  in  Ullman  (1983)  and  Hildreth  and  Koch  (1987). 
These  theoretical  results  are  important  for  the  study  of  the  recovery  of  structure  from  mo¬ 
tion  in  biological  vision  systems,  for  two  reasons.  First,  they  show  that  by  using  the  rigidity 
assumption,  a  unique  structure  can  be  recovered  from  motion  information  alone.  It  is  not 
necessary  to  make  further  physical  assumptions,  in  order  to  obtain  a  unique  solution.  Sec¬ 
ond,  these  results  show  that  it  is  possible  to  recover  3-D  structure  by  integrating  image 
information  over  a  small  extent  in  space  and  in  time.  This  second  observation  could  bear 
on  the  neural  mechanisms  that  compute  structure  from  motion;  in  principle,  they  need  only 
integrate  motion  information  over  a  limited  area  of  the  visual  field  and  a  limited  extent  in 
time. 

Computational  studies  also  provide  algorithms  for  deriving  the  structure  of  moving 
objects.  Typically,  measurements  of  the  positions  or  velocities  of  image  features  give  rise  to 
a  set  of  mathematical  equations  whose  solution  represents  the  desired  3— D  structure.  The 
algorithms  generally  derive  this  structure  from  motion  information  extracted  over  a  limited 
area  of  the  image  and  a  limited  extent  in  time.  Testing  of  these  algorithms  reveals  that 
although  this  strategy  is  possible  in  theory,  it  is  not  reliable  in  practice.  A  small  amount 
of  error  in  the  image  measurements  can  lead  to  very  different  (and  often  incorrect)  3-D 
structures.  This  behavior  is  due  in  part  to  the  observation  that  over  a  small  extent  in  space 
and  time,  very  different  objects  can  induce  almost  identical  patterns  of  motion  in  the  image 
(Ullman,  1983,  1984a). 

This  sensitivity  to  error  inherent  in  algorithms  that  integrate  motion  information  only 
over  a  small  extent  in  space  and  time  suggests  that  a  robust  scheme  for  deriving  structure 
should  use  image  information  that  is  more  extended  in  space  or  time  or  both.  This  conclusion 
is  supported  in  recent  computational  studies  (for  example,  Bruss  and  Horn,  1983;  Lawton, 
1983;  Ullman,  1984a;  Adiv,  1985;  Bolles  and  Baker,  1985;  Negahdaripour  and  Horn,  1985; 
Yasumoto  and  Medioni,  1985;  Bharwani,  Riseman  and  Hanson,  1986;  Shariat  and  Price, 
1986;  Subbarao,  1986;  Waxman  and  Wohn,  1987),  which  show  that  consideration  of  motion 
information  that  is  more  extended  in  space  or  time  can  lead  to  a  stable  recovery  of  structure. 
The  extension  in  time  can  be  achieved  by  considering  a  large  number  of  discrete  frames  or 
by  observing  continuous  motion  over  a  significant  temporal  extent. 

With  regard  to  the  human  visual  system,  the  dependence  of  perceived  structure  on  the 
spatial  and  temporal  extent  of  the  viewed  motion  has  not  yet  been  studied  systematically, 
but  the  following  informal  observations  have  been  made.  Regarding  spatial  extent,  two  or 
three  points  undergoing  relative  motion  are  sufficient  to  elicit  a  perception  of  3-D  structure 
(Borjesson  and  von  Hofsten,  1973;  Johansson,  1975),  although  theoretically  the  recovery 
of  structure  is  less  constrained  for  two  points  in  motion,  and  perceptually  the  sensation  of 
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structure  is  weaker.  In  addition,  appropriate  visual  stimulation  in  a  small  region  of  the 
peripheral  visual  field  is  adequate  to  elicit  a  strong  sense  of  self-motion  (Johansson,  1971). 
Regarding  the  temporal  extent  of  viewed  motion,  Johansson  (1975)  showed  that  a  brief 
observation  of  patterns  of  moving  lights  generated  by  human  figures  moving  in  the  dark 
(commonly  referred  to  as  biological  motion  displays)  can  lead  to  a  perception  of  the  3— D 
motion  and  structure  of  the  figures.  Other  perceptual  studies  indicate  that  the  human  visual 
system  requires  an  extended  time  period  to  reach  an  accurate  perception  of  3-D  structure 
(Wallach  and  O’Connell,  1953;  White  and  Mueser,  1960;  Doner,  Lappin  and  Perfetto,  1984; 
Inada  et  al.,  1986).  A  brief  observation  of  a  moving  pattern  sometimes  yields  an  impression 
of  structure  that  is  “flatter”  than  the  true  structure  of  the  moving  object.  Thus,  the  human 
visual  system  is  capable  of  deriving  some  sense  of  structure  from  motion  information  that  is 
integrated  over  a  small  extent  in  space  and  time.  An  accurate  perception  of  structure  may, 
however,  require  a  more  extended  viewing  period. 

Most  methods  compute  a  3-D  structure  from  motion  only  when  the  changing  image 
can  be  interpreted  as  the  projection  of  a  rigid  object  in  motion.  They  otherwise  yield  no 
interpretation  of  structure  or  yield  a  solution  that  is  incorrect  or  unstable.  Algorithms 
that  are  exceptions  to  this  can  interpret  only  restricted  classes  of  nonrigid  motions  (for 
example,  Bennett  and  Hoffman,  1985;  Hoffman  and  Flinchbaugh,  1982;  Koenderink  and  van 
Doom,  1986;  Subbarao,  1986).  The  human  visual  system,  however,  can  derive  some  sense 
of  structure  for  a  wide  range  of  nonrigid  motions,  including  stretching,  bending  and  more 
complex  types  of  deformation  (Johansson,  1975;  Jansson  and  Johansson,  1973;  Todd,  1982, 
1984).  Furthermore,  displays  of  rigid  objects  in  motion  sometimes  give  rise  to  the  perception 
of  somewhat  distorting  objects  (Wallach,  Weisz  and  Adams,  1956;  White  and  Mueser,  1960; 
Braunstein  and  Andersen,  1984;  Hildreth,  1984;  Adelson,  1985).  These  observations  suggest 
that  while  the  human  visual  system  tends  to  choose  rigid  interpretations  of  a  changing 
image,  it  probably  does  not  use  the  rigidity  assumption  in  the  strict  way  that  previous 
computational  studies  suggest. 

Ullman  (1984a)  proposed  a  more  flexible  method  for  deriving  structure  from  motion  that 
interprets  both  rigid  and  nonrigid  motion.  Referred  to  as  the  incremental  rigidity  scheme, 
this  algorithm  uses  the  rigidity  assumption  in  a  different  way  from  previous  studies.  It 
maintains  an  internal  model  of  the  structure  of  a  moving  object  that  consists  of  the  estimated 
3-D  coordinates  of  points  on  the  object.  The  model  is  continually  updated  as  new  positions 
of  image  features  are  considered.  Initially,  the  object  is  assumed  to  be  flat,  if  no  other  cues 
to  3-D  structure  are  present.  Otherwise,  its  initial  structure  may  be  determined  by  other 
cues  available,  from  stereopsis,  shading,  texture,  or  perspective.  As  each  new  view  of  the 
moving  object  appears,  the  algorithm  computes  a  new  set  of  3-D  coordinates  for  points 
on  the  object  that  maximizes  the  rigidity  in  the  transformation  from  the  current  model 
to  the  new  positions.  This  is  achieved  by  minimizimg  the  change  in  the  3-D  distances 
between  points  in  the  model.  Thus  the  algorithm  interprets  the  changing  2-D  image  as 
the  projection  of  a  moving  3-D  object  that  changes  as  little  as  possible  from  one  moment 


27 


to  the  next.  Through  a  process  of  repeatedly  considering  new  views  of  objects  in  motion 
and  updating  the  current  model  of  their  structure,  the  algorithm  builds  up  and  maintains 
a  3-D  model  of  the  objects.  If  objects  deform  over  time,  the  3-D  model  computed  by  the 
algorithm  also  changes  over  time.  A  parallel  model  based  on  UUman’s  incremental  rigidity 
scheme  was  recently  proposed  by  Landy  (1986). 

Physiological  studies  have  uncovered  neurons  in  higher  cortical  areas  that  are  sensitive 
to  properties  of  the  motion  field  that  may  be  relevant  to  the  recovery  of  the  3-D  structure 
and  motion  of  surfaces  in  the  environment,  or  to  the  recovery  of  the  motion  of  the  observer 
relative  to  the  scene.  Many  studies  have  revealed  neurons  sensitive  to  uniform  expansion 
or  contraction  of  the  visual  field,  a  property  that  is  correlated  either  with  translation  of  the 
observer  forward  or  backward,  or  equivalently,  motion  of  an  object  toward  or  away  from  the 
observer.  Such  neurons  have  been  found,  for  example,  in  the  posterior  parietal  cortex  of  the 
monkey  (Motter  and  Mountcastle,  1981;  Andersen,  1987).  Other  neurons  have  been  found 
that  are  sensitive  to  global  rotations  in  the  visual  field  (see,  for  example,  Andersen,  1987; 
Sakata  et  ai,  1985).  All  of  these  neurons  have  large  receptive  fields,  so  they  probably  lack 
the  spatial  sensitivity  required  to  derive  the  detailed  shape  of  an  object  surface  from  relative 
motion.  A  recent  study  by  Siegel  and  Andersen  (1986)  shows  that  motion  processing  in  area 
MT  of  primate  visual  cortex  is  critical  to  the  recovery  of  structure  from  motion. 

3.  HIGH  LEVEL  VISION 

The  processes  we  have  examined  so  far  belong  to  the  realm  of  low  level  vision.  Their 
goal  is  primarily  to  recover  properties  of  the  surrounding  environment.  Such  processes  can 
be  evaluated  on  the  basis  of  their  validity,  that  is,  whether  the  results  they  deliver  (such 
as  depth,  surface  orientation,  etc.)  are  correct  and  accurate.  The  low  level  processes  were 
also  characterized  by  a  bottom-up  and  parallel  mode  of  processing.  “Bottom-up”  means 
that  the  processing  depends  on  the  visual  stimulus,  and  not  on  the  task  being  performed. 
“Parallel”  here  means  spatial  parallelism;  that  is,  the  same  operations  are  being  performed 
across  the  entire  visual  field  or  large  parts  of  it. 

In  high  level  vision,  the  descriptions  produced  by  ike  earlier  stages  are  used  for  tasks 
such  as  recognition,  visually  guided  manipulation,  locomotion,  and  navigation  through  the 
environment.  At  this  stage  the  emphasis  is  often  on  usefulness  rather  than  validity.  There 
are,  for  example,  many  possible  ways  to  describe  the  elements  in  a  given  image  for  the 
purpose  of  recognition,  but  some  may  be  more  useful  than  others.  The  processing  here  is 
goal  directed  (as  opposed  to  bottom-up),  and  spatially  focused  (as  opposed  to  spatially 
uniform). 

A  further  distinction  can  be  made  between  intermediate  and  high  level  vision.  High 
level  vision  uses  specific  knowledge  about  objects  in  the  world,  such  as  a  catalog  of  objects 
stored  in  long  term  memory,  while  intermediate  vision  does  not.  The  terms  “intermediate” 
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and  “high  level”  vision  do  not  mean  to  imply  a  sequential  order.  It  is  possible,  for  example, 
that  recognition  tasks  may  proceed  on  their  own  without  relying  on  the  prior  application 
of  intermediate  level  processes.  In  section  3.1  we  discuss  the  area  of  intermediate  vision, 
and  in  particular,  the  problem  of  extracting  shape  properties  and  spatial  relations  among 
objects  in  the  scene,  from  visual  information.  In  section  3.2  we  discuss  one  of  the  major 
problems  in  high  level  vision  —  visual  object  recognition. 

3.1  Intermediate  Vision  and  Visual  Routines 

The  extraction  of  shape  properties  and  spatial  relations 

The  use  of  visual  information  often  requires  the  extraction  of  shape  properties  of  con¬ 
tours  and  regions  and  the  analysis  of  spatial  relations  among  items  in  the  image.  This 
analysis  plays  a  role  in  visual  classification,  and  recognition.  An  example  is  shown  in  Figure 
9.  The  figure  on  the  left  is  recognized  effortlessly  and  immediately  as  representing  a  face, 
as  opposed  to  the  jumbled  face  figure  on  the  right.  In  this  example,  the  features  comprising 
the  face  are  highly  schematic.  The  eye  or  nose  by  themselves,  for  example,  are  crude  rep¬ 
resentations  of  a  real  eye  or  nose.  It  is  primarily  the  spatial  relations  among  the  schematic 
items  that  make  it  a  face  profile.  The  eye  is  inside  the  head  boundary,  the  nose  is  attached 
to  the  boundary,  the  eye  is  above  the  nose  and  below  the  eyebrow,  and  so  on.  Our  ability  to 
recognize  objects  easily  from  such  a  representation  demonstrates  that  we  can  naturally  and 
effortlessly  extract  in  a  single  glance  spatial  relations  such  as  inside,  above  or  attached-to, 
and  then  use  these  relations  in  recognition  tasks. 

There  is  some  evidence  that  the  ability  to  analyze  such  configurations  and  to  detect 
certain  basic  abstract  shape  properties  and  spatial  relations  already  exists  at  a  very  early 
age.  For  example,  some  studies  (Fantz,  1961)  have  suggested  that  infants  as  young  as  one  to 
five  weeks  of  age  already  make  a  distinction  between  “correct”  and  jumbled  schematic  faces. 
At  the  physiological  level,  neurons  were  found  in  the  STS  area  of  the  macaque  visual  cortex 
that  respond  specifically  to  faces  (Gross,  Rocha-Miranda  and  Bender,  1972).  According  to 
some  reports  (Perret,  Rolls  and  Caan,  1982),  these  cells  often  respond  to  schematic  line- 
drawn  faces,  but  not  to  scrambled  face  figures.  These  findings  suggest  the  existence  of 
mechanisms  that  respond  not  only  to  specific  and  realistic  faces,  but  to  any  configuration 
that  has  the  general  overall  shape  and  correct  spatial  relations  among  its  parts. 

The  visual  analysis  of  shape  and  spatial  relations  also  plays  an  important  role  in  plan¬ 
ning  actions  and  manipulating  objects.  During  such  activities,  we  often  use  vision  to  answer 
such  questions  as  “can  object  A  fit  into  the  space  between  objects  B  and  C?”  as  shown  in 
Figure  10.  These  questions  may  not  be  posed  explicitly,  but  information  about  such  re¬ 
lations  is  obtained  routinely  in  our  daily  activities.  Problems  of  this  type  do  not  require 
recognition,  because  they  do  not  depend  on  naming  the  objects  or  on  past  familiarity  with 
them.  They  do,  however,  require  the  spatial  analysis  of  shape  and  relations  among  objects. 
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Figure  9,  Recognizing  faces.  The  figure  on  the  left  is  perceived  immediately  as  a  face  profile.  The 
figure  on  the  right  contains  similar  parts  in  a  different  spatial  arrangement. 


In  this  sense,  this  analysis  belongs  more  to  the  level  termed  above  “intermediate  vision” 
than  to  high  level  vision. 

Properties  of  intermediate  level  vision:  non-uniformity ,  open-endedness  and  task  depen¬ 
dence 

The  computations  involved  in  extracting  shape  and  spatial  relations  axe  also  clearly  not 
part  of  the  early  visual  processes,  but  must  operate  at  a  later,  separate  stage.  As  far  as  we 
know,  the  operations  involved  in  the  construction  of  the  early  visual  representations  (such 
as  edge  detection,  the  computation  of  depth,  color,  motion  analysis,  and  so  on),  proceed 
uniformly  and  in  parallel  across  the  entire  visual  field  or  a  large  part  of  it.  In  contrast,  it 
is  not  feasible  to  have  at  every  location  an  “inside/ out  side  detector,”  in  the  same  way  that 
orientation  of  contours  is  recovered  by  a  dense  array  of  specialized  orientation  detectors. 
More  generally,  it  is  not  feasible  to  extract  uniformly  and  in  a  bottom-up  manner  all  the 
possibly  relevant  shape  properties  and  spatial  relations  in  the  image. 

The  extraction  of  shape  and  spatial  relations  is  also  more  open-ended  than  the  earlier 
visual  processes.  That  is,  different  tasks  may  require  the  extraction  of  different  shape 
properties  and  relations.  There  does  not  seem  to  be  a  clear  bound  on  the  number  of 
properties  and  relations  that  can  be  extracted;  new  ones  can  be  learned  as  required  by  the 
task.  As  a  result  of  these  properties,  the  computations  involved  in  intermediate  level  vision 
cannot  be  entirely  bottom-up,  but  must  be  task  dependent.  That  is,  in  viewing  the  same 


Figure  10.  Spatial  reasoning.  It  is  possible  to  obtain,  from  visual  information,  answers  to  questions 
such  as,  can  the  black  disk  be  moved  to  the  location  of  the  cross  without  colliding  with^any  of  the 
objects  in  the  figure? 


scene  on  two  different  occasions,  the  processes  applied  to  the  scene  at  this  intermediate  level 
of  processing  may  be  different,  depending  on  the  visual  task  being  accomplished. 

The  complexity  of  extracting  shape  properties  and  spatial  relations 

The  perception  of  many  shape  properties  and  relations  appears  to  us  subjectively  to  be 
immediate  and  effortless.  This  apparent  immediateness  is,  however,  deceiving;  the  compu¬ 
tations  required  for  extracting  shape  and  spatial  relations  are  often  quite  complicated.  In 
fact,  in  many  cases  the  efficient  analysis  of  shape  and  spatial  relations  is  still  beyond  the 
reach  of  current  computer  vision  systems. 

A  simple  illustrative  example  is  the  perception  of  inside/outside  relations,  that  is, 
whether  a  given  location  is  inside  or  outside  a  closed  curve.  When  the  curve  is  not  too 
convoluted,  this  relation  is  easy  to  extract  by  “merely  looking”  at  the  image.  In  Figure  11, 
the  location  marked  “x”  is  clearly  inside  the  curve  in  Figure  11a,  and  outside  the  curve  in 
Figure  lib.  How  do  we  reach  this  conclusion  so  effortlessly? 

In  computer  vision,  a  popular  method  of  computing  inside/outside  relations  is  the  so- 
called  “ray  intersection”  method.  A  ray  is  drawn  from  the  location  in  question  to  the  edge 


Figure  11.  Judging  inside/outside.  The  fact  that  the  x  lies  inside  the  closed  figure  in  (a)  and 
outside  in  (b)  is  established  effortlessly,  by  “merely  looking”  at  the  figure. 

of  the  image,  and  the  intersections  made  by  the  ray  with  the  curve  are  counted.  If  the 
number  is  even,  the  point  is  outside  the  curve,  otherwise  it  must  be  inside.  This  procedure 
is  simple,  but  clearly  insufficient.  Consider,  for  instance,  the  examples  in  Figure  12.  In 
Figure  12a,  the  number  of  ray  intersections  is  even,  but  the  point  lies  inside  the  curve. 
This  is  because  the  ray  intersection  method  makes  an  implicit  assumption  that  the  curve 
in  question  is  isolated  in  the  visual  field,  an  assumption  that  is  violated  in  this  example.  In 
Figure  12b,  the  number  of  intersections  is  odd,  therefore  the  ray  intersection  method  would 
conclude  that  the  point  lies  inside  the  curve.  In  this  case,  the  error  arises  because  the  ray 
intersection  method  incorporates  the  assumption  that  the  curve  is  closed.  Figure  12c  is  a 
slight  variation  on  the  inside/outside  problem.  The  task  is  to  determine  whether  any  of  the 
points  lie  inside  the  closed  curve.  The  ray  intersection  method  would  require  applying  the 
test  to  all  of  the  individual  dots.  What  our  visual  system  seems  to  do  instead  is  somehow  to 
inspect  the  figure  and  verify  that  none  of  the  dots  lie  in  its  interior.  Finally,  in  Figure  12d, 
the  task  is  to  find  an  “innermost”  location  (that  is,  a  point  that  lies  inside  all  three  curves). 
This  task  is  visually  straightforward,  but  it  cannot  be  solved  by  any  simple  application  of 
the  ray  intersection  method. 

The  conclusion  is  that  the  method  used  by  our  own  visual  system  is  more  powerful  and 
flexible  than  the  ray  intersection  method  or  any  other  method  proposed  so  far  in  compu- 
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Figure  12.  Deficiencies  of  the  ray  intersection  method,  (a)  The  number  of  intersections  is  even, 
but  the  point  lies  inside  the  curve,  (b)  The  number  is  odd,  but  the  point  is  not  inside  a  closed 
region,  (c)  The  task  is  to  determine  whether  any  of  the  points  lies  inside  the  figure,  (d)  The  task 
is  to  find  a  location  that  is  internal  to  all  three  curves. 


tational  vision.  More  generally,  we  see  that  the  computation  of  seemingly  simple  spatial 
relations  is  surprisingly  difficult.  In  addition,  the  problem  is  complicated  by  the  facts  that 
intermediate  level  vision  requires  the  capacity  to  extract  efficiently  an  essentially  unbounded 
set  of  properties  and  relations,  and  that  the  computations  performed  must  depend  on  the 
particular  visual  task  at  hand. 

Visual  routines 

A  possible  approach  to  this  problem  is  that  the  perception  of  shape  properties  and 
spatial  relations  is  achieved  by  the  application  of  visual  routines  to  the  early  visual  repre¬ 
sentation  (Ullman,  1984b).  These  visual  routines  are  sequences  of  some  basic  operations 
that  are  “wired  into'  the  visual  system.  Routines  for  different  properties  and  relations  are 
composed  from  the  same  set  of  basic  operations,  using  different  combinations  for  different 
tasks.  Using  a  fixed  set  of  basic  operations,  the  visual  system  can  assemble  different  rou¬ 
tines,  and  in  this  manner  extract  a  large  variety  of  different  shape  properties  and  relations. 
The  basic  operations  themselves  are  carried  out  by  specialized  and  highly  efficient  mech¬ 
anisms.  To  understand  the  perception  of  spatial  relations  in  general,  an  important  step 
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would  be  to  identify  the  set  of  basic  operations.  To  explain  how  we  perceive  a  particular 
relation  such  as  “above”  or  “inside”  would  require,  in  this  framework,  a  specification  of  the 
particular  routine  used  for  establishing  the  relation  in  terms  of  the  underlying  operations. 
In  the  following  section,  we  describe  briefly  three  basic  operations  that  appear  to  play  a 
useful  role  in  visual  routines:  shifting  and  indexing,  coloring,  and  boundary  tracing. 

Shifting  and  indexing 

The  apphcation  of  visual  routines  implies  a  spatial  and  temporal  structuring  of  the 
processing.  A  certain  operation  is  applied  at  a  given  location  in  the  visual  field,  then 
(perhaps  depending  on  the  results  of  this  application)  another  operation  is  applied  at  a 
new  location.  “Shifting”  is  the  general  operation  of  moving  the  focus  of  processing  to  a 
required  location.  “Indexing”  is  a  particular  type  of  shifting,  the  shifting  of  the  processing 
focus  to  a  salient  location  in  the  visual  field.  There  is  a  considerable  body  of  experimental 
evidence  related  to  the  redirecting  of  the  processing  focus,  the  so-called  “spotlight”  of  visual 
attention.  This  body  of  evidence  is  reviewed,  for  example,  by  Posner  (1980),  and  Treisman 
and  Gelade  (1980). 

\  simple  and  elegant  demonstration  that  appears  to  employ  shifting  and  indexing  comes 
from  Treisman’s  experiments  with  the  conjunction  of  elementary  properties  (see  also  Beck 
and  Ambler,  1973).  Her  experiments  have  shown  that  certain  odd-man-out  targets  can  be 
detected  in  a  field  of  distractors  in  a  parallel  manner:  the  time  to  detect  the  target  does 
not  depend  significantly  on  the  number  of  distractors  in  the  background.  For  example,  a 
target  blue  X  can  be  detected  in  constant  time  in  a  field  of  brown  T’s  and  green  X’s  (up 
to  thirty  background  elements  were  used  in  these  experiments)  on  the  basis  of  its  unique 
color.  Similarly,  a  target  T  can  be  detected  in  a  field  of  X’s  in  constant  time,  presumably 
on  the  basis  of  orientation  differences.  In  contrast,  certain  combinations  (conjunctions)  of 
elementary  properties  require  sequential  search.  For  example,  the  target  may  be  a  green  T 
in  a  field  of  brown  T’s  and  green  X’s.  The  target  in  this  case  matches  half  the  distractors 
in  shape  and  half  in  color.  The  evidence  suggests  that  in  this  case  the  processing  becomes 
quite  different:  the  visual  system  appears  to  scan  the  items  sequentially  in  search  of  the 
required  conjunction.  The  scan  is  performed  to  a  large  extent  internally,  rather  than  by  eye 
movements. 

The  scan  used  in  the  conjunction  search  is  a  psychophysical  example  of  the  use  of  a 
shifting  operation.  In  the  parallel  search  situation,  the  odd-man-out  element  is  immediately 
indexable,  that  is,  the  processing  focus  can  shift  immediately  to  it  without  dwelling  on  other 
elements.  This  is,  therefore,  an  example  of  an  indexing  operation.  Recently,  phenomena 
related  to  shifting  and  selective  visual  attention  have  also  been  studied  at  the  physiological 
level  (see,  for  example,  the  study  by  Moran  and  Desimone  (1985)  regarding  location-specific 
responses  in  area  V4  of  the  alert  macaque  monkey). 
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Region  coloring  and  boundary  tracing 

Coloring  and  tracing  are  two  somewhat  similar  operations.  “Coloring”  is  the  operation 
of  labeling  a  region  with  a  unique  label.  Such  labeling  can  be  achieved  by  the  bounded 
activation  of  a  region  in  the  following  maimer.  Starting  from  a  given  point,  the  area  around 
it  in  the  internal  representation  is  somehow  activated.  This  activation  spreads  outward 
until  a  boundary  is  reached,  but  it  is  not  allowed  to  cross  the  boundary.  If  the  initial 
point  lies  inside  a  closed  boundary,  the  entire  region  will  be  “colored,”  or  labeled.  This 
labeling  can  provide,  for  example,  a  basis  for  separating  “inside”  from  “outside”  in  the 
tasks  considered  in  Figures  11  and  12.  These  tasks  also  often  involve  an  indexing  operation: 
the  x  is  indexed  first,  and  this  location  serves  as  the  starting  point  for  the  coloring  operation. 
Additional  stages  would  be  required  in  all  of  these  examples  to  complete  the  inside/outside 
computations;  however,  these  additions  will  not  be  considered  further  here. 

Boundary  tracing  is  a  similar  operation  applied  to  one-dimensional  entities  in  the  image, 
such  as  contours  and  boundaries.  Starting  from  a  given  location  on  a  contour,  the  contour 
is  traced  sequentially  in  a  given  direction  and  labeled.  From  a  computational  standpoint, 
such  an  operation  could  serve  a  useful  role  in  the  analysis  of  contours  and  boundaries,  by 
separating  them  from  nearby  contours.  For  a  general  discussion  of  coloring  and  boundary 
tracing,  see  Ullman  (1984b),  and  for  a  psychophysical  exploration  of  boundary  tracing,  see 
Jolicoeur,  Mackey  and  Ullman  (1986). 

We  have  considered  above  only  a  small  sample  of  the  problems  in  intermediate  level 
vision.  On  the  whole  work  in  this  area  is  only  in  its  infancy,  and  many  questions  remain 
unanswered.  The  approach  discussed  above  is  only  one  of  severed  possible  alternatives  and 
has  so  far  been  primarily  motivated  by  computational  considerations,  and  little  is  known 
at  present  about  the  processes  by  which  the  human  visual  system  extracts  abstract  shape 
properties  and  spatial  relations  in  the  image. 

3.2  Visual  Object  Recognition 

The  common  view  of  object  recognition  is  that  it  requires  some  sort  of  matching  between 
the  internal  representations  of  an  object  stored  in  memory,  and  a  similar  representation 
generated  from  an  object  in  the  image.  The  process  of  object  recognition  is  therefore 
different  from  processes  of  intermediate  and  low  level  vision  in  that  it  is  more  intimately 
related  to  the  problems  of  memory  organization,  retrieval,  expectations,  and  reasoning. 

In  this  section  we  consider  the  problem  of  recognizing  objects  visually  on  the  basis  of 
their  shape.  Shape-based  recognition  is  only  part  of  the  problem  of  visual  object  recog¬ 
nition,  as  other  “paths”  to  recognition  besides  shape  are  possible.  Objects  sure  sometimes 
recognized  more  on  the  basis  of  their  color  or  texture  than  on  the  basis  of  their  shape.  In 
other  instances,  visual  recognition  requires  certain  reasoning  processes  rather  than  the  iden¬ 
tification  of  a  specific  shape.  We  will  also  not  consider  here  the  problem  often  referred  to 
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as  the  segmentation  problem,  which  has  to  do  with  the  delineation  of  the  object  of  interest 
in  an  image.  Finally,  we  will  consider  primarily  the  recognition  of  individual  objects,  rather 
than  object  classification. 

Why  is  object  recognition  difficult?  According  to  one  approach  (termed  the  “direct” 
approach  (Abu-Mostafa  and  Psaltis,  1987)),  the  problem  is  primarily  one  of  memory  size 
and  efficient  parallel  matching.  To  recognize  objects,  we  have  to  store  a  large  number  of 
object  views  in  memory.  In  attempting  to  recognize  an  object,  we  must  compare  the  object 
currently  in  view  with  previously  stored  views  of  objects  and  select  the  one  that  resembles 
it  most.  Models  of  human  associative  memory  (Willshaw,  Buneman  and  Longuet-Higgins, 
1969;  Hopfield,  1982)  have  suggested  that  brain  mechanisms  can  cope  efficiently  with  both 
of  these  problems.  In  large  scale  neural  networks,  a  large  number  of  patterns  can  be  stored, 
and  a  new  pattern  can  be  compared  simultaneously  with  all  previously  stored  patterns. 

There  are,  however,  at  least  two  problems  with  the  direct  approach.  First,  the  set  of  all 
possible  views  of  all  possible  objects  is  likely  to  be  prohibitively  large,  and  at  the  same  time 
redundant.  To  recognize,  for  instance,  triangles  of  any  shape,  position,  and  orientation,  it 
clearly  is  not  necessary  to  store  in  memory  a  large  number  of  representative  shapes.  Second, 
it  should  be  noted  that  the  comparison  used  in  such  schemes  between  the  current  image 
and  the  images  stored  in  memory  is  a  simple  one,  essentially  a  correlation  of  the  viewed 
image  with  previously  stored  images.  In  contrast,  we  can  recognize  objects  from  novel  views 
that  may  not  be  closely  similar  ta  any  previously  stored  view,  by  this  simple  comparison 
measure. 

The  alternatives  to  the  direct  approach  assume  that  we  do  not  store  and  compare  the 
input  patterns  per  se.  Instead,  we  produce  from  the  image  some  sort  of  internal  representa¬ 
tion  of  the  viewed  object,  and  compare  it  with  similar  descriptions  stored  in  memory.  The 
hope  is  that  some  of  the  variations  in  the  different  appearances  of  an  object  will  be  handled 
by  the  description  process.  That  is,  an  object  seen  under  diJerent  viewing  conditions,  e.g. 
from  different  viewpoints,  will  still  produce  highly  similar  descriptions.  Most  of  the  schemes 
that  have  been  proposed  to  date  can  be  classified  into  three  maun  approaches:  the  first  based 
on  invariant  properties,  the  second  on  object  decomposition  into  parts,  and  the  third  on  a 
process  of  explicitly  compensating  for  the  transformation  between  the  viewed  object  and 
the  stored  model. 

The  invariant  properties  approach 

This  approach  to  object  recognition  assumes  that  objects  have  certain  invariant  proper¬ 
ties  that  are  common  to  aill  of  their  views.  Formally,  a  property  of  this  type  can  be  thought 
of  as  a  function  from  object  images  to  the  domain  of  real  numbers.  For  example,  various 
applications  have  used  the  notion  of  a  “compactness  measure,”  defined  as  the  ratio  of  the 
object’s  perimeter  length  to  the  square  root  of  its  apparent  area.  Round  and  compact  ob¬ 
jects  will  have  a  low  score  on  this  measure,  while  thin  and  elongated  objects  have  a  high 


score.  Furthermore,  the  measure  is  unaffected  by  rotation,  translation,  and  scaling  in  the 
image  plane.  The  idea  is  to  define  a  number  of  such  measures,  and  collectively  they  would 
serve  to  identify  each  object  unambiguously.  The  overall  recognition  process  is  thus  broken 
down  into  the  extraction  of  a  number  of  different  properties,  followed  by  a  final  decision 
stage  based  on  these  properties. 

In  a  popular  variant  of  this  approach,  a  property  is  not  expected  to  be  entirely  invariant 
for  the  different  views  of  an  object  (or  class  of  objects),  but  to  be  restricted  to  a  small 
interval.  Properties  of  different  objects  may  have  partially  overlapping  ranges,  but  it  is 
hoped  that  by  defining  a  number  of  different  properties  it  will  be  possible  to  characterize 
each  object  uniquely.  This  leads  naturally  to  the  notion  of  “feature  spaces”  used  extensively 
in  pattern  recognition.  If  n  different  properties  are  used,  each  object  can  be  represented  as 
a  point  in  an  n-dimensional  space.  Recognition  and  classification  then  become  problems  of 
clustering  points  in  such  spaces. 

Another  approach  that  belongs  to  this  general  class  of  invariant  properties  theories 
is  Gibson’s  theory  of  high-order  invariances  (Gibson,  1950,  1979).  Gibson  suggested  that 
invariant  properties  of  objects  may  be  discovered  in  “high  order”  invariances  in  the  optic  ar¬ 
ray.  Such  invariances  were  assumed  to  be  based  primarily  on  spatial  and  temporal  gradients 
of  texture  density. 

The  invariant  properties  approach  has  been  applied  with  some  success  to  limited  prob¬ 
lems,  such  as  the  recognition  of  simple  industrial  parts  under  controlled  viewing  conditions. 
The  approach  does  not  generalize  well,  however,  to  more  complicated  domains.  The  weak¬ 
ness  of  the  approach  seems  to  be  that  it  is  difficult,  and  often  impossible,  to  find  relatively 
simple  properties  that  are  preserved  across  the  different  transformations  that  an  object  may 
undergo. 

Object  decomposition  into  parts 

A  second  approach,  which  replaced  the  previous  one  as  the  main  approach  to  object 
recognition,  relies  on  the  decomposition  of  objects  into  constituent  parts.  The  approach 
appeals  to  the  intuition  that  many  objects  seem  to  have  natural  parts.  For  example,  the 
human  body  can  be  divided  at  some  level  into  a  head,  torso,  arms,  and  legs.  In  the  de¬ 
composition  approach,  these  parts  axe  detected,  and  the  recognition  of  the  entire  object 
proceeds  on  the  basis  of  the  constituent  parts  and  their  relationships. 

There  are  a  number  of  subclasses  within  this  general  approach,  the  most  popular  of 
which  has  been  the  “structural  description”  method.  This  approach  assumes  that  it  would 
be  easier  to  capture  object  invariances  after  their  parts  have  been  identified.  In  particular, 
simple  relations  between  pa  ,3  will  remain  invariant  across  all  object  views.  For  example, 
in  the  character  “T”  one  can  define  two  main  parts  and  certain  relations  (such  as  “join”) 
that  are  common  to  ail  legal  instances  of  this  character. 
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An  early  example  of  a  theory  of  this  type  applied  to  human  vision  is  Milner’s  (1974) 
model  of  visual  shape  recognition.  This  theory  deals  primarily  with  2-D  shapes,  and  uses 
as  basic  parts  primitive  features  such  as  edges  and  line  segments.  Marr  and  Nishihara 
(1978)  proposed  a  structural  description  theory  applied  to  3-D  objects.  They  used  as 
primitive  parts  elongated  volumes  called  “generalized  cylinders.”  Recognition  proceeds  by 
constructing  a  description  of  the  object  in  terms  of  these  cylinders  and  their  relationships. 
The  description  produced  in  this  manner  is  “object  centered,”  in  the  sense  that  the  final 
description  relies  on  the  internal  relations  among  parts,  and  not  on  their  relations  relative 
to  the  viewer.  Additional  recent  theories  of  this  class  applied  to  human  vision  include 
Biederman’s  (1985)  theory  of  recognition  by  components,  and  Hoffman  and  Richards’  (1986) 
“codon”  scheme  for  the  description  and  recognition  of  image  contours. 

Although  the  use  of  object  parts  seems  to  have  some  clear  merits,  i:  also  appears 
that  the  use  of  structural  descriptions  as  proposed  in  most  of  these  schemes  has  a  number 
of  limitations.  One  is  that  the  decomposition  into  natural  parts  is  often  insufficient  to 
characterize  the  object  in  question.  In  many  cases  precise  shape,  rather  than  the  general 
arrangement  of  parts,  appears  to  be  important.  Another  problem  that  appears  to  limit 
the  applicability  of  the  approach  is  that  many  objects  do  not  decompose  naturally  into  the 
union  of  clearly  distinct  parts. 

The  alignment  approach 

The  idea  behind  the  alignment  approach  is  to  detect  and  explicitly  compensate  for 
the  transformations  between  an  object  currently  in  view  and  its  model  stored  in  memory. 
This  concept  has  been  applied  in  the  past  only  in  simple  domains,  such  as  printed  character 
recognition  of  a  known  font  type  (Neisser,  1967,  chapter  3).  In  this  application  it  is  assumed 
that  a  character  in  view  may  differ  from  its  ideal  prototype  stored  in  memory  because  it 
may  rotate,  translate,  and  change  scale.  The  first  step  in  the  recognition  process  is  therefore 
to  compensate  for  these  transformations.  This  compensation  can  be  done  prior  to  the 
identification  of  the  letter  in  question.  For  example,  displacement  can  be  compensated  for 
by  computing  the  “center  of  mass”  of  the  letter  and  then  shifting  the  letter  so  that  this  point 
always  coincides  with  a  fixed  location.  Size  and  orientation  can  also  be  “normalized”  during 
this  first  stage.  Following  the  normalization,  the  remaining  differences  between  the  viewed 
letter  and  its  stored  model  are  small,  and  a  straightforward  comparison  should  suffice  to 
identify  the  letter  correctly. 

This  general  approach  has  been  revived  recently  and  extended  in  an  attempt  to  apply  it 
to  the  recognition  of  complex  3-D  objects.  The  general  idea  remains  the  same.  The  viewed 
object  differs  from  its  stored  model,  because  it  has  undergone  a  certain  transformation  such 
as  a  change  in  position  and  3-D  rotation  (in  the  case  of  rigid  objects).  In  an  alignment 
approach,  the  first  stage  is  to  detect  this  transformation  prior  to  the  identification  of  the 
object.  The  transformation  can  then  be  “undone”;  this  stage  is  called  the  alignment  stage. 
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Following  the  alignment,  the  stored  model  and  viewed  object  are  in  close  agreement,  and 
the  correct  model  is  therefore  easier  to  determine. 

Compared  to  translation,  rotation  and  scaling  in  the  image,  the  transformations  that 
the  image  of  a  3-D  object  can  undergo  are  more  complicated  to  determine  and  compensate 
for.  A  number  of  recent  schemes  (for  example,  Lowe,  1986;  Ullman,  1986;  Huttenlocher  and 
Ullman,  1987)  have  been  proposed  for  extending  the  alignment  method  to  deal  with  natural 
objects.  The  approach  appears  to  offer  some  advantages,  but  it  has  been  demonstrated  so 
far  only  in  limited  domains. 

It  must  be  concluded  that  in  object  recognition,  which  is  one  of  the  most  fundamental 
aspects  of  human  vision,  theories  (as  well  as  experimental  work)  still  have  a  long  way  to  go. 
Some  combination  of  an  alignment  process  with  part  decomposition  may  offer  a  promising 
starting  point,  but  these  techniques  must  be  extended  and  generalized  considerably  before 
they  will  be  able  to  cope  successfully  with  common  objects  that  are  recognized  efficiently 
by  the  human  visual  system. 
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